Data Science With RapidMiner: January 2011

Saturday, 29 January 2011

Creating test data with attributes that match the training data

When applying a model to a test set, it is (usually) important that the number and names of attributes match those used to create the model.

When extracting features from data where the attribute names depend on the data, it can often be the case that test data both lacks all the attributes of the model and may have additional ones.

The example here shows the following

Training data with attributes att2 to att10 and a label
Test data with attributes att1 and att2
Attribute att1 is removed from the test data by using weights from the training data
Attributes att2 to att10 and the label are added to the the test data by using a join operator
The resulting test data contains attributes att2 to att10 and a label. Only att2 has a value, all others are missing

Sunday, 16 January 2011

How regression performance varies with noise

Here's a table showing how the various performance measures from the regression performance operator change as more and more noise is added for a regression problem (it's the process referenced in another post).

This is not a feature of RapidMiner as such, it's simply a quick reference to show the limits of these performance measures under noise free and noisy conditions so that when they are seen for a real problem, the table might help orientate how good the model is.

Be careful using example sets inside loops

If you have an example set that is used inside a loop it's important to remember that any changes made to attributes will be retained between iterations. To get round this, you must use the materialize data operator. This creates a new copy of the data at the expense of increased memory usage.

Here's an example that shows more and more noise being added to an example set before doing a linear regression.

If you plot noise against correlation and set the colour of the data points to be the attribute "before" you will see that when before = 2 (corresponding to no materialize operation), correlation drops more quickly than it should as noise increases. The noise-correlation curve is more reasonable when before = 1 (corresponding to enabling the materialize operation). The inference is that noise is being added on top of noise from a previous loop to make correlation poorer than expected.

In the example, the order in which the parameters is set is also important. If you get this wrong, the parameters will be reset at the wrong time relative to the materialize operator which causes incorrect results.

Sunday, 9 January 2011

What does X-Validation do?

The model from X-Validation applied to all the test data will not match the performance estimate given by the same operator.

This is correct behaviour but can be confusing.

Following on from this thread on the RapidMiner forum my explanation of this is as follows.

A 10 fold cross validation on the Iris data set using Decision Trees produces a performance estimate of 93.33% +/- 5.16%. If you use the model produced by this operator on the whole dataset you get a performance of 94.67% (note: in fact, this model is the same as one produced by using the whole dataset).

Which performance number should be believed and why is there a difference?

Imagine someone were to produce a new Iris data set. If we use the complete model to predict the new data, what performance would we get? Would 94.67% or 93.33% be more realistic? There is no right answer but the 93.33% figure is probably more likely to be realistic because this is what the performance estimate is providing. The whole point of the X-Validation operator is to do this; namely estimate what the performance would be on unseen data; outputting a model is not its main aim.

It works by splitting the data into 10 (this is the default) partitions each containing 90% of the original data. It then builds a model on each 90% and applies it to the 10% to get a performance. It does this for all 10 partitions to get an average.

To illustrate with an example, the following 10 performance values are produced inside the operator (I'm using accuracy from the Performance Classification operator and the random number seed for the whole process is 2001).

0.867
0.933
0.933
1.0
0.933
1.0
0.933
1.0
0.867
0.867

The arithmetic mean of these is 0.933 and this matches the performance output by the X-Validation operator; namely 93.33%. The stdevp of these values is 0.051511; this matches the 5.16% error output by the operator. This means the X-Validation operator averages the performance for all the iterations inside the operator (note: if the performance result shows an error estimate then you know that some averaging has been done).

Each of the 10 models may be different. This is inevitable because the data in each partition is different. For example, manual inspection of the models produced inside the operator leads to differences as shown below.

The final model happens to be the same as the first of these but there is no reason to suppose that it would be in general. In practice, the models are likely to be the same in all cases, but it is worth bearing in mind that they might not be.

Nonetheless, the end result is likely to represent the performance of a model built using known data and applied to unknown data. Simply building a model on all data and then applying it to this data may result in a performance that is higher because of overfitting. Using this latter prediction is not as good because no unseen data has been used to make any models.

Tuesday, 4 January 2011

Groovy script prompts for user input

This guide "how to extend RapidMiner guide" gives an example of how to write a Groovy script. I paid my €40 and I have created example scripts to show for it.

Here's a really simple example that enhances the Fast Fourier Transform process here to allow it to prompt the user for two values that get used by the process. It's a bit quicker than editing the process each time.

Get the new process here

Monday, 3 January 2011

K-distances plots

The operator "Data to Similarity" calculates how similar each example is to each other example within an example set. Here's an example using 5 examples with euclidean distance as the measure.

Firstly, the examples...

Now the distances...

A k-distance plot displays, for a given value of k, what the distances are from all points to the kth nearest. These are sorted and plotted.

For k = 2, which is equivalent to the nearest neighbour, the nearest distances for each id are

0.014
0.014
0.177
0.378
0.400

(edit: I made a mistake previously, items 3, 4 and 5 were wrong)

The plot looks like this

The smallest value is to the right rather than starting at the left near the origin.

These plots can be used to determine choices for the epsilon parameter in the DBScan clustering operator.

Some more notes about this to follow...

Data Science With RapidMiner

Search this blog