Search this blog

Monday, 19 December 2011

Operators that deserve to be better known: part I

There are many operators in RapidMiner. Some are rarely mentioned but deserve to be more widely known because they can do something that would otherwise need a lot of dangerous and exhausting gymnastics (and as a side note, the origin of the word gymnastics comes from a Greek word for naked; you have been warned).

Two such operators are "Fill Data Gaps" and "Replace Missing Values (Series)".

The first of these examines the id attributes of an example set, arranges them in order and works out if there are any missing (integer id attributes are important - using other types causes problems). If there are it will do its best to create new examples to fill the gaps. An illustration is always helpful so imagine you have 5 examples with ids 1,2,5,6,7. The "Fill Data Gaps" operator will create 2 new examples with ids of 3 and 4. Any new examples will be created with all the attributes within the example set but of course these will be set to missing. This is where the second operator can be useful.

The "Replace Missing Values (Series)" operator is part of the series extension so it perhaps doesn't get out as often as it should but it's very useful. Given a load of missing values, it will try and fill them in by assuming the examples form a series. It works on individual attributes in example sets only and treats each attribute as part of a sequential series. As it works its way along a series, if it encounters a missing value it will try and fill it in based on its parameter settings. These settings are "previous value", "next value", "value" and "linear interpolation". For an illustration, imagine you have 7 examples with attribute att1 set to 10,11,?,?,17,19,20. For example, using the "previous value" setting would cause these missing values to be set to 11 and 11. Using "linear interpretation" would set them to 13 and 15.

I find myself using these operators a lot but I keep forgetting the names; hence this post.

3 comments:

  1. Hi Andrew - I'm quite new to RapidMiner and in one of my projects, I have to join couple of of financial datasets by time and fill missing values in a Series with previous value (similar to fill forward feature in pandas dataframe). I came across your blog post but I still can't find the "previous value", "next value" or "interpolation" setting under Replace Missing Value component. All I can do is replace missing value with minimum, maximum, average or with some predefined value. Will appreciate your comment on this. Thanks!

    ReplyDelete
  2. I forgot to add I'm using rapid miner's community free edition.

    ReplyDelete
  3. Hello Gauz

    There are 2 operators with names close to 'Replace Missing Values'. The one you want is called 'Replace Missing Values (Series)' and is located in the Series extension. This is not available by default and has to be installed by downloading from the RapidMiner Marketplace. Once you do this, all will become clear.

    Andrew

    ReplyDelete