Data Science With RapidMiner: Populating SQL Server from Apache NiFi

Apache NiFi continues to amaze with its capabilities.

The only issue is the sometimes slightly impenetrable documentation that doesn't join the dots and a good example is the processor called ConvertJSONToSQL. This processor converts simple JSON into an SQL INSERT statement that will allow a relational database to be populated with the JSON data. To make this work, there are a number of other things that need to be got right first. In the interests of giving something back, I'll describe below what I had to do to make it all work.

The first thing to do is make sure the JSON you have has a simple structure. By simple, I mean that the each JSON document contains name-value pairs and there are no nested fields. This makes sense when you consider that you are trying to populate a single table with named values so there needs to be a one-to-one correspondence between the JSON fields and the SQL table.

Next, make sure you have actually created the SQL table and remember the name of it as well as the schema and database that you created it under. Make sure the names of the columns in your database match the names of the JSON fields. This last point is important.

You must also download the correct JDBC driver from Microsoft. The download location changes often so the best approach is to search for "Microsoft JDBC Driver 4.2 for SQL Server". You will eventually find a file called something like "sqljdbc_6.2.2.0_enu.tar.gz". Buried in this zip file will be another file called "sqljdbc42.jar". This is the one you want and you must place this in a location that can be seen by NiFi. I happen to be running NiFi in a Docker environment and for simplicity's sake I put the file in "/tmp". Obviously, you might be different and you may have to go off-piste here. Note that more modern versions of the JDBC driver are available. If you want, you can download and use these.

Now you must create a thing called a DBCPConnectionPool. This defines where your database is and how to access it. The ConvertJSONToSQL processor refers to this. The easiest way to create this is to edit the JDBC Connection Pool configuration of the ConvertJSONToSQL processor and follow the dialog to create a new connection pool. You'll know you are in the right place when you are editing a controller service. The critical parameters for this are

The Database Connection URL: set this to something like jdbc:sqlserver://yourserveraddress
The Database Driver Class Name: set this to com.microsoft.sqlserver.jdbc.SQLServerDriver
The Database Driver Locations(s): set this to the location where you have saved the sqljdbc driver file
The Database User: set this to a suitable user
Password: set this to the correct password (note that the password is not saved by default when saving templates: this means you have to enter it if you are importing a template from elsewhere. This behaviour can be changed but that's a subject for another day)

Once configured, the controller must be enabled by clicking on the little lightning icon. If the state of the service goes to Enabled, all is well. If it sticks at Enabling then it usually means the driver cannot be found.

Now we can get on with creating the SQL from JSON. Connect the flow containing the JSON to the ConvertJSONToSQL processor. The critical parameters are

JDBC Connection Pool: Set this to the connection pool created above
Statement Type: set this to INSERT
Table Name: set this to the name of the table you created above
Catalog Name: set this to the name of the database that you created the table in
Schema Name: set this to the name of the schema under which you created the table
Translate Field Names: set this to false

The other fields are more complex and depend on what fields you have and what you want to do if they are not present in the source or not present in the target.

Now you can run the ConvertJSONToSQL processor and you should see output from the SQL relationship. If you examine the flow files that are created you should see SQL Insert statements. If you do, then all is well. At first sight, however, the contents of the flow files seem to be missing data values. Fear not, the values are filled in from the attributes associated with the flow file. This means that the input JSON is split into multiple SQL flow files, one for each JSON document. This might appear like it would cause a performance problem but it all works later when the PutSQL processor is run. This processor is capable of batching up multiple SQL inserts. If you get errors at this point, it is often because of field names being mismatched. Be aware that the processor queries the database to get the field names directly to perform this checking.

The final step is to use the PutSQL processor and pass the SQL flow files to it. The JDBC Connection Pool parameter for this processor must be set to the connection pool created above. Run this processor and you should see data being inserted into your database.

So, in summary, it's quite an involved process but I hope this has helped you get there a bit more quickly.

Data Science With RapidMiner

Search this blog

Wednesday, 17 January 2018

Populating SQL Server from Apache NiFi

No comments:

Post a Comment

About Me

Labels

Blog Archive