Search this blog

Sunday, 28 July 2013

Aggregating attributes with parentheses

I stumbled upon a feature of the Aggregate operator just now that took me far too long to understand; I should have known better. In the spirit of altruism, I hope the following post will save others a bit of time.

It's well known that if attribute names contain parantheses or certain other mathematical symbols, the "Generate Attributes" operator will have problems. Users can get frustrated by this but it's easy to workaround simply by renaming the attributes. To retain backward compatibility I believe it would be extremely disruptive for the RapidMiner product to be changed so we have to live with it.

I discovered that the "Aggregate" operator behaves similarly. The following illustrative process builds a model on the Iris data set and then applies it to the original data (purists will wince at the over-fitting). The process then aggregates by the attributes "label" and "prediction(label)" and counts the number of examples for these combinations. The process also aggregates using a renamed attribute without the parantheses. I have selected "count all combinations" so I am expecting to see 9 rows in the output.

The first output looks like this.

Notice how the "prediction(label)" attribute is missing.

The second output looks like this.

Now we see all 9 expected rows (and continue to wince at the overfitting).

Unfortunately, there is no warning message for the absence in the first case. This probably explains why it took me a while to understand what the problem was. Arguably this could be a bug but I subscribe to the view that the only issues that matter are the ones you don't know about. We know about this one so we can work around it.

As an aside, I have invented a little Groovy script that bulk renames attributes to a standard form but crucially it outputs a second mapping example set which can be stored so the renaming can be reversed later. It's a bit rough and ready so time prevents me from polishing it enough so I can feel good about posting it.

No comments:

Post a Comment