Data Science With RapidMiner: 2018

Saturday, 8 December 2018

Seeing how generated attributes are constructed

Sometimes, a "brute force feature generation and selection-athon" is irresistible.

I had a feeling that some data I was looking at contained hidden relationships between attributes that could have yielded an improved prediction performance. I had a gut feel that dividing one attribute by another or perhaps taking the log of one and adding it the reciprocal of another might give a new attribute with more predictive power. How to do this without a tiresome manual intervention that would have been boring, could have missed some permutation, and could have made mistakes?

There are a number of ways of doing this in RapidMiner. One approach uses one of the iterating operators, collectively known as YAGGA, to perform an evolutionary search. Each iteration generates new attributes by combining existing attributes using simple functions. The performance is assessed and attributes that don't lead to an improvement are eliminated whilst those that do are retained to allow them to generate yet more attributes. This process repeats until the desired stopping conditions have been reached.

For the masochist, there is a lower level operator called "Generate Function Set" that allows control to be exerted over the operation. I adopted this because I wanted to look in detail at the attributes that were leading to improvements and equally see those that led nowhere.

So I made a process. But then I got stuck because I found that there was no way in the RapidMiner Studio GUI, to see what construction had been applied to generate new attributes. A bit of background; when RapidMiner generates new attributes, they show up with names of the form "gensymxxx". In the old days, there was a way of seeing the attribute construction from one of the viewing panes. Alas, it's not there anymore.

Luckily, there is an operator called "Write Constructions". This takes an example set and writes it to a file which contains details of the construction. A bit laborious but workable.

Did I find a new attribute that made an improvement? Yes I did. It was a small improvement but enough to be interesting. The improvement is the sort of thing that would get you from the middle of the leaderboard to be a contender in a Kaggle competition.

Thursday, 25 January 2018

Keras + RapidMiner + digit recognition = 97% accuracy

I've successfully created a process using RapidMiner and Keras to recognise the MNIST handwritten digits with a headline accuracy of 97% on unseen data.

You can download the process here.

It requires R and Keras to be installed - an exercise for the reader.

The main features of the process are

R is used to get the MNIST data and create a training set and a test set. I use R to add the label to the data to make a single example set for the two cases rather than have separate structures for the data and the labels. This is a big strength of RapidMiner because all the data and labels are in one place.
The data is restructured in R to change 3d tensors of shape (60000, 28, 28) to 2d tensors of shape (60000, 768). The 3d tensor represents the images each of size 28 by 28 pixels. RapidMiner example sets are 2d tensors but these are OK to feed into the Keras part of the process.
The Keras part of the model has the following characteristics

The input shape is (784,), this matches the number of columns in the 2d tensor.
The loss parameter is set to "categorical_crossentropy" and the optimizer is set to "RMSprop".
There are 2 layers in the Keras model. The first is "dense" with 512 units and activation set to "relu". The second is "dense" with 10 units and activation set to "softmax". The 10 in this case is the number of different possible values the label can be.

The "validation_split" parameter is set to 0.1 so that a loss is calculated on a small part of the training data. This leads to validation loss results in the output which is used to see when over-fitting is happening.

Here is a screenshot of the history from a large run (this is output from the Keras model as an example set).

The training loss (in blue) decreases systematically as the model learns the training data more and more. The loss against the validation data (in red) shows worse performance as the number of epochs increases and the variation between epochs is evidence that perhaps I should use a larger training fraction. Nonetheless, only a small number of epochs would be enough to get a model that would perform well on unseen data.

The Keras model does not use convolution layers (an exercise for a later post) but despite this, it performs very well. Here is the confusion matrix using 3 epochs.

This is a very good result and shows the power of deep learning. It's gratifying that RapidMiner supports it.

As time permits, a future post will look at using convolution layers to see what improvements could be achieved. I may also do some systematic experiments to check how validation loss measured during training maps to actual loss on unseen data.

Wednesday, 24 January 2018

Visualising the MNIST numbers data

Keras comes with some built in functions to obtain the MNIST dataset created by the National Institute of Standards and Technology. As far as I can tell, it's not possible to get access to these from within RapidMiner but never fear, here is a process that can do it.

It uses R and obviously requires Keras to be have been installed. I'll leave that to the reader to get right.

The process also chooses one of the digits and casts it into a form that allows it to be displayed. It does this using the "Windowing" operator followed by "De-Pivot" to transform the matrix like data into x,y,z tuples.

Here's the 6th digit displayed using a block chart. This looks like a 2.

I've already used R with Keras to create a classifier that can recognise these digits. This is my first step towards using Keras in RapidMiner to build a classifier to do the same job.

Visualising discrete wavelet transforms: updated for RapidMiner v8

I revisited a previous post about visualising discrete wavelet transforms because I wanted to remember how I did something. The process is quite old and did not work first time with version 8 of RapidMiner Studio. There have been some subtle changes with respect to the requirements for the type of attributes for the "Join" and "De-Pivot" operators. Never fear, I've updated the process and it's here.

Here is the money shot to prove it still works

An interesting feature of this process is the way it uses the "De-Pivot" operator to transform a matrix-like example set into x,y,z coordinates that can be plotted.

Thursday, 18 January 2018

Genetic Algorithms with R and Shiny

Following on from this post, here's an application that uses a genetic algorithm to find the maximum of a complex function as the inputs to it are varied.

I've implemented this in R and Shiny and the application is hosted on shinyapps.io here. The application lets a user try to beat the computer and find the optimum by brute force alone. Needless to say, there's no chance of a human managing this in any meaningful time.

As with the previous post, the Rastrigin function is being used as the function to which the optimum inputs are to be sought. The R package GA is being used to find the optimum. Gratifyingly, hardly any code is needed; it is on GitHub here.

The progress of the genetic algorithm towards the goal can be plotted. Here's an example.

As with RapidMiner, convergence towards the optimum is quick, and it usually happens in about 30 generations.

In summary, of course R can do this, but for me the cool part is how concise the code is and by using Shiny how easy it is to show to others.

Wednesday, 17 January 2018

Populating SQL Server from Apache NiFi

Apache NiFi continues to amaze with its capabilities.

The only issue is the sometimes slightly impenetrable documentation that doesn't join the dots and a good example is the processor called ConvertJSONToSQL. This processor converts simple JSON into an SQL INSERT statement that will allow a relational database to be populated with the JSON data. To make this work, there are a number of other things that need to be got right first. In the interests of giving something back, I'll describe below what I had to do to make it all work.

The first thing to do is make sure the JSON you have has a simple structure. By simple, I mean that the each JSON document contains name-value pairs and there are no nested fields. This makes sense when you consider that you are trying to populate a single table with named values so there needs to be a one-to-one correspondence between the JSON fields and the SQL table.

Next, make sure you have actually created the SQL table and remember the name of it as well as the schema and database that you created it under. Make sure the names of the columns in your database match the names of the JSON fields. This last point is important.

You must also download the correct JDBC driver from Microsoft. The download location changes often so the best approach is to search for "Microsoft JDBC Driver 4.2 for SQL Server". You will eventually find a file called something like "sqljdbc_6.2.2.0_enu.tar.gz". Buried in this zip file will be another file called "sqljdbc42.jar". This is the one you want and you must place this in a location that can be seen by NiFi. I happen to be running NiFi in a Docker environment and for simplicity's sake I put the file in "/tmp". Obviously, you might be different and you may have to go off-piste here. Note that more modern versions of the JDBC driver are available. If you want, you can download and use these.

Now you must create a thing called a DBCPConnectionPool. This defines where your database is and how to access it. The ConvertJSONToSQL processor refers to this. The easiest way to create this is to edit the JDBC Connection Pool configuration of the ConvertJSONToSQL processor and follow the dialog to create a new connection pool. You'll know you are in the right place when you are editing a controller service. The critical parameters for this are

The Database Connection URL: set this to something like jdbc:sqlserver://yourserveraddress
The Database Driver Class Name: set this to com.microsoft.sqlserver.jdbc.SQLServerDriver
The Database Driver Locations(s): set this to the location where you have saved the sqljdbc driver file
The Database User: set this to a suitable user
Password: set this to the correct password (note that the password is not saved by default when saving templates: this means you have to enter it if you are importing a template from elsewhere. This behaviour can be changed but that's a subject for another day)

Once configured, the controller must be enabled by clicking on the little lightning icon. If the state of the service goes to Enabled, all is well. If it sticks at Enabling then it usually means the driver cannot be found.

Now we can get on with creating the SQL from JSON. Connect the flow containing the JSON to the ConvertJSONToSQL processor. The critical parameters are

JDBC Connection Pool: Set this to the connection pool created above
Statement Type: set this to INSERT
Table Name: set this to the name of the table you created above
Catalog Name: set this to the name of the database that you created the table in
Schema Name: set this to the name of the schema under which you created the table
Translate Field Names: set this to false

The other fields are more complex and depend on what fields you have and what you want to do if they are not present in the source or not present in the target.

Now you can run the ConvertJSONToSQL processor and you should see output from the SQL relationship. If you examine the flow files that are created you should see SQL Insert statements. If you do, then all is well. At first sight, however, the contents of the flow files seem to be missing data values. Fear not, the values are filled in from the attributes associated with the flow file. This means that the input JSON is split into multiple SQL flow files, one for each JSON document. This might appear like it would cause a performance problem but it all works later when the PutSQL processor is run. This processor is capable of batching up multiple SQL inserts. If you get errors at this point, it is often because of field names being mismatched. Be aware that the processor queries the database to get the field names directly to perform this checking.

The final step is to use the PutSQL processor and pass the SQL flow files to it. The JDBC Connection Pool parameter for this processor must be set to the connection pool created above. Run this processor and you should see data being inserted into your database.

So, in summary, it's quite an involved process but I hope this has helped you get there a bit more quickly.

Data Science With RapidMiner

Search this blog