Data Science With RapidMiner: Feature Selection

Saturday, 7 November 2009

Feature Selection

The process of obtaining the attributes that characterise an example in an example set can be time consuming. As an example, analysing a music sample using various value series techniques can take many minutes. Anything that allows the number of attributes to be reduced without reducing the predictive power of a model is worth pursuing.

There is another reason why reducing the number of attributes is a good idea. This relates to overfitting. This happens when a model has too many attributes to consider and so tends to focus on the detail and miss the big picture. The ability of a model to predict new data is reduced. Overfitting is an important problem to understand and detect and is for another post.

Rapidminer has an operator called FeatureSelection that examines the performance of a model whilst varying the contribution from the attributes in the example set and ensuring that the accuracy of the model is not compromised.

In version 4.6 of Rapidminer, the online tutorial has an example in 10_ForwardSelection. This example shows how the FeatureSelection operator is placed around the XValidation operator. The XValidation operator is measuring the performance of the model against multiple training and test sets extracted from the example set. The FeatureSelection operator is invoking the XValidation operator multiple times with slightly different attribute contributions. Eventually, it will settle down once it has found the minimum number of attributes required to maintain model predictive accuracy.

The operator produces an attributes weight table to indicate which attributes should be retained and which can be ignored. The weights vary between 0 and 1. The value 0 means that the attribute can be ignored.

In the tutorial example, the process correctly identifies that attributes 1,2 and 3 are relevant for the model.