Search this blog

Monday 19 December 2011

Operators that deserve to be better known: part I

There are many operators in RapidMiner. Some are rarely mentioned but deserve to be more widely known because they can do something that would otherwise need a lot of dangerous and exhausting gymnastics (and as a side note, the origin of the word gymnastics comes from a Greek word for naked; you have been warned).

Two such operators are "Fill Data Gaps" and "Replace Missing Values (Series)".

The first of these examines the id attributes of an example set, arranges them in order and works out if there are any missing (integer id attributes are important - using other types causes problems). If there are it will do its best to create new examples to fill the gaps. An illustration is always helpful so imagine you have 5 examples with ids 1,2,5,6,7. The "Fill Data Gaps" operator will create 2 new examples with ids of 3 and 4. Any new examples will be created with all the attributes within the example set but of course these will be set to missing. This is where the second operator can be useful.

The "Replace Missing Values (Series)" operator is part of the series extension so it perhaps doesn't get out as often as it should but it's very useful. Given a load of missing values, it will try and fill them in by assuming the examples form a series. It works on individual attributes in example sets only and treats each attribute as part of a sequential series. As it works its way along a series, if it encounters a missing value it will try and fill it in based on its parameter settings. These settings are "previous value", "next value", "value" and "linear interpolation". For an illustration, imagine you have 7 examples with attribute att1 set to 10,11,?,?,17,19,20. For example, using the "previous value" setting would cause these missing values to be set to 11 and 11. Using "linear interpretation" would set them to 13 and 15.

I find myself using these operators a lot but I keep forgetting the names; hence this post.