Sometimes, a "brute force feature generation and selection-athon" is irresistible.
I had a feeling that some data I was looking at contained hidden relationships between attributes that could have yielded an improved prediction performance. I had a gut feel that dividing one attribute by another or perhaps taking the log of one and adding it the reciprocal of another might give a new attribute with more predictive power. How to do this without a tiresome manual intervention that would have been boring, could have missed some permutation, and could have made mistakes?
There are a number of ways of doing this in RapidMiner. One approach uses one of the iterating operators, collectively known as YAGGA, to perform an evolutionary search. Each iteration generates new attributes by combining existing attributes using simple functions. The performance is assessed and attributes that don't lead to an improvement are eliminated whilst those that do are retained to allow them to generate yet more attributes. This process repeats until the desired stopping conditions have been reached.
For the masochist, there is a lower level operator called "Generate Function Set" that allows control to be exerted over the operation. I adopted this because I wanted to look in detail at the attributes that were leading to improvements and equally see those that led nowhere.
So I made a process. But then I got stuck because I found that there was no way in the RapidMiner Studio GUI, to see what construction had been applied to generate new attributes. A bit of background; when RapidMiner generates new attributes, they show up with names of the form "gensymxxx". In the old days, there was a way of seeing the attribute construction from one of the viewing panes. Alas, it's not there anymore.
Luckily, there is an operator called "Write Constructions". This takes an example set and writes it to a file which contains details of the construction. A bit laborious but workable.
Did I find a new attribute that made an improvement? Yes I did. It was a small improvement but enough to be interesting. The improvement is the sort of thing that would get you from the middle of the leaderboard to be a contender in a Kaggle competition.
Search this blog
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment