Search this blog

Friday, 8 March 2019

I'm giving a talk about RapidMiner

I'll be giving a talk at the Data Science Reading Meetup group on the 26th March entitled "Introduction to RapidMiner". It's intended to be a brief introduction that should help people decide if RapidMiner is right for them.

I just discovered that it's possible to refer someone else to RapidMiner, and if that person installs the product, you get 10,000 extra rows in your license up to a maximum of 50,000. 

There are 28 people going to the talk. How I wish I could have 10,000 rows for each referral I plan to send!

Saturday, 8 December 2018

Seeing how generated attributes are constructed

Sometimes, a "brute force feature generation and selection-athon" is irresistible.

I had a feeling that some data I was looking at contained hidden relationships between attributes that could have yielded an improved prediction performance. I had a gut feel that dividing one attribute by another or perhaps taking the log of one and adding it the reciprocal of another might give a new attribute with more predictive power. How to do this without a tiresome manual intervention that would have been boring, could have missed some permutation, and could have made mistakes?

There are a number of ways of doing this in RapidMiner. One approach uses one of the iterating operators, collectively known as YAGGA, to perform an evolutionary search. Each iteration generates new attributes by combining existing attributes using simple functions. The performance is assessed and attributes that don't lead to an improvement are eliminated whilst those that do are retained to allow them to generate yet more attributes. This process repeats until the desired stopping conditions have been reached.

For the masochist, there is a lower level operator called "Generate Function Set" that allows control to be exerted over the operation. I adopted this because I wanted to look in detail at the attributes that were leading to improvements and equally see those that led nowhere.

So I made a process. But then I got stuck because I found that there was no way in the RapidMiner Studio GUI, to see what construction had been applied to generate new attributes. A bit of background; when RapidMiner generates new attributes, they show up with names of the form "gensymxxx". In the old days, there was a way of seeing the attribute construction from one of the viewing panes. Alas, it's not there anymore.

Luckily, there is an operator called "Write Constructions". This takes an example set and writes it to a file which contains details of the construction. A bit laborious but workable.

Did I find a new attribute that made an improvement? Yes I did. It was a small improvement but enough to be interesting. The improvement is the sort of thing that would get you from the middle of the leaderboard to be a contender in a Kaggle competition.

Thursday, 25 January 2018

Keras + RapidMiner + digit recognition = 97% accuracy

I've successfully created a process using RapidMiner and Keras to recognise the MNIST handwritten digits with a headline accuracy of 97% on unseen data.

You can download the process here.

It requires R and Keras to be installed - an exercise for the reader.

The main features of the process are

  • R is used to get the MNIST data and create a training set and a test set. I use R to add the label to the data to make a single example set for the two cases rather than have separate structures for the data and the labels. This is a big strength of RapidMiner because all the data and labels are in one place. 
  • The data is restructured in R to change 3d tensors of shape (60000, 28, 28) to 2d tensors of shape (60000, 768). The 3d tensor represents the images each of size 28 by 28 pixels. RapidMiner example sets are 2d tensors but these are OK to feed into the Keras part of the process.
  • The Keras part of the model has the following characteristics
    • The input shape is (784,), this matches the number of columns in the 2d tensor.
    • The loss parameter is set to "categorical_crossentropy" and the optimizer is set to "RMSprop".
    • There are 2 layers in the Keras model. The first is "dense" with 512 units and activation set to "relu". The second is "dense" with 10 units and activation set to "softmax". The 10 in this case is the number of different possible values the label can be.
  • The "validation_split" parameter is set to 0.1 so that a loss is calculated on a small part of the training data. This leads to validation loss results in the output which is used to see when over-fitting is happening.
Here is a screenshot of the history from a large run (this is output from the Keras model as an example set).

The training loss (in blue) decreases systematically as the model learns the training data more and more. The loss against the validation data (in red) shows worse performance as the number of epochs increases and the variation between epochs is evidence that perhaps I should use a larger training fraction. Nonetheless, only a small number of epochs would be enough to get a model that would perform well on unseen data.

The Keras model does not use convolution layers (an exercise for a later post) but despite this, it performs very well. Here is the confusion matrix using 3 epochs.

This is a very good result and shows the power of deep learning. It's gratifying that RapidMiner supports it.

As time permits, a future post will look at using convolution layers to see what improvements could be achieved. I may also do some systematic experiments to check how validation loss measured during training maps to actual loss on unseen data.