Data Science With RapidMiner: November 2009

Sunday, 29 November 2009

StringTextInput pruning

The manual is not clear what the parameters prune_below and prune_above mean.

They are a way to remove very common or very rare attributes from an example set. If you set prune_below to 40% and prune_below to 10% you will remove about 70% of the example set.

I have no idea how it calculates these - over to the code...

Saturday, 14 November 2009

MultivariateSeries2WindowExamples

The most basic thing this operator does is to convert an example set containing a single attribute into an example set containing multiple attributes. The values of the multiple attributes in a single row of the new example set are in the same order as the single attributes in the original example set.

An example helps as always; this uses the following process.

<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="target_function" value="sum"/>
<parameter key="number_of_attributes" value="1"/>
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="label"/>
<parameter key="invert_filter" value="true"/>
<parameter key="apply_on_special" value="true"/>
</operator>
<operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
<parameter key="window_size" value="10"/>
</operator>
</operator>

This is a picture of the example set before loading into the MultivariateSeries2WindowExamples operator

This is the picture of the example set after.

Notice how the value of att1 for id 1 in the first example set is 2.468 and this becomes att1-9 in the first row of the second example set. The value 7.267 for id 2 become att1-8 and so on.

This example uses the parameter value "encode_series_by_examples". This means the operator looks down the consecutive examples to produce the horizontal attribute values.

If the parameter is set to "encode_series_by_attributes" the operator looks horizontally along the attributes of the first set to produce the output example set.

Friday, 13 November 2009

Making AgglomerativeClustering work

Make sure you supply a label and an id otherwise you will get the frustrating error of the form "there's something wrong but it's not obvious".

Sunday, 8 November 2009

Generate Data - some detail

Edit Jan 2014: Some time ago this operator changed its name to Generate Data from ExampleSet Generator.

This operator has a number of parameters and the documentation doesn't quite give enough information (I'll update these when time and motivation permits).

Here is a list (taken from the source code) of how the label is calculated from the attributes for some of the values of the target_function parameter.

random: label = random
sum: label = att1 + att2 + att3 + ... + attn
polynomial: label = att1*att1*att1 + att2*att2 + att3
non linear: label = att1*att2*att3 + att1*att2 + att2*att2
one variable non linear: label = 3*att1*att1*att1 - att1*att1 + 1000 / abs(att1) + 2000*abs(att1)
complicated function: label = att1*att1*att2 - att1*att2 + max(att1,att2) - exp(att3)
complicated function2: label = att1*att1*att1 + att2*att2 + att1*att2 + att1/att2 - 1/(att3*att3)
simple sinus: label = sin(att1)
sinus: label = sin(att1*att2) + sin(att1+att2)
simple superposition: label = 5*sin(att1) + sin(30*att1)
sinus frequency: label = 10*sin(3*att1) + 12*sin(7*att1) + 11*sin(5*att2) + 9*sin(10*att2) + 10*sin(8*(att1 + att2))
sinus with trend
sinc
triangular function
square pulse function
random classification: label = "positive" or "negative" randomly chosen
one third classification: label = "positive" if att1 is greater than 0.333333333333 otherwise "negative"
sum classification: label = "positive" if the sum of all the attributes is greater than 0 otherwise "negative"
quadratic classification: label = "positive" if attribute2 > attribute1^2 otherwise "negative"
simple non linear classification: label = "positive" if attribute1*attribute2 is between 50 and 80 otherwise "negative"
interaction classification: label = "positive" if att1 lt 0 or (att2 gt 0 and att3 lt 0) otherwise "negative"
simple polynomial classification: label = "positive" if att1^4 > 100 otherwise "negative"
polynomial classification: label = "positive" if att0^3 + att1^2 - att2^2 + att3 > 0 otherwise "negative"
checkerboard classification
random dots classification
global and local models classification
sinus classification
multi classification: round the sum of the attributes and take an absolute integer value, if the result is divisible by 2 the label becomes "one", if divisible by 3 (but not 2) the label becomes "two", if divisible by 5 (but not 2 or 3) the label becomes "three", otherwise "four"
two gaussians classification
transactions dataset
grid function
three ring clusters
spiral cluster
single gaussian cluster
gaussian mixture clusters: generates clusters in an N dimensional space where N is the number of attributes. The number of clusters is 2^N (so take care, 20 attributes leads to more than a million clusters which slows RapidMiner down somewhat)
driller oscillation timeseries

Saturday, 7 November 2009

Feature Selection

The process of obtaining the attributes that characterise an example in an example set can be time consuming. As an example, analysing a music sample using various value series techniques can take many minutes. Anything that allows the number of attributes to be reduced without reducing the predictive power of a model is worth pursuing.

There is another reason why reducing the number of attributes is a good idea. This relates to overfitting. This happens when a model has too many attributes to consider and so tends to focus on the detail and miss the big picture. The ability of a model to predict new data is reduced. Overfitting is an important problem to understand and detect and is for another post.

Rapidminer has an operator called FeatureSelection that examines the performance of a model whilst varying the contribution from the attributes in the example set and ensuring that the accuracy of the model is not compromised.

In version 4.6 of Rapidminer, the online tutorial has an example in 10_ForwardSelection. This example shows how the FeatureSelection operator is placed around the XValidation operator. The XValidation operator is measuring the performance of the model against multiple training and test sets extracted from the example set. The FeatureSelection operator is invoking the XValidation operator multiple times with slightly different attribute contributions. Eventually, it will settle down once it has found the minimum number of attributes required to maintain model predictive accuracy.

The operator produces an attributes weight table to indicate which attributes should be retained and which can be ignored. The weights vary between 0 and 1. The value 0 means that the attribute can be ignored.

In the tutorial example, the process correctly identifies that attributes 1,2 and 3 are relevant for the model.

Stacking

With version 4.6 of Rapidminer the tutorial entitled 19_Stacking demonstrates stacking. This is a way to use the results of one type of learning model to improve a later model.

In the example there are 4 learning models. In order these are naive bayes, decision tree, nearest neighbours and linear regression.

The stacking operator applies the decision tree operator first and adds a new attribute to the example set. This attribute corresponds to the prediction made by the decision tree learner. Next, it applies the nearest neighbours model and adds another new atribute. The linear regression model is applied next to add a third new prediction.

At this point, the example set contains the original label, the original attributes as well as three new attributes for the three predictions. The naive bayes model now makes its prediction. This prediction should be better because of the other models' predictions.

The example in the tutorial doesn't show this.

If you want to then do the following. Put an XValidation operator around the stacking operator, add an operator chain after this and then add a model applier operator and a performance operator. This is best explained with a picture.

This is a standard way to perform cross validation, I have a more detailed explanation here.

If you run this you will get what is known as a confusion matrix. This shows how well the model is at classifying based on a comparison against known classifications. Here's another picture to show this.

In this case, the number of incorrect classifications is very small.

Now we can try to change the original stacking operator to see what effect each of the learning models has on the results. For example, disable the decision trees operator and run the process again. This time the result is much poorer. Here's an example.

Now it's possible to try different operators to see what effect each has. There is an advanced feature of Rapidminer that allows a search to be done automatically but that's a subject for another day.

Thursday, 5 November 2009

What is RapidMiner?

RapidMiner is very fine open source software for data mining. Its previous name was YALE.

A couple of years ago, the product was renamed "RapidMiner" with a corresponding holding company name of "Rapid-I". The open source software is still available but there is also a commercial version with support, training and consultancy available. Go here to see it all.

RapidMiner is an "academic grade" meta-product, this is both good and bad. It's good for flexible leading edge experimentation but for mere mortals it is fantastically hard to learn and fantastically hard to use.

For better or worse, I am learning RapidMiner and it helps me to learn if I write things down and try things out. Using a blog forces me to structure properly since I have to pretend I am addressing the content to someone else.

Data Science With RapidMiner

Search this blog