As the recent successful winner of the RapidMiner farming on Mars
competition, I thought I would post the evolutionary process that helped lead me to the best solution.
I've tidied it up a bit and in fact there are 4 processes. The main one calls the others using "Execute Process" which ensures that common processing is placed in one location to avoid making errors.
The four processes are.
- EvoExample.rmp
- ReadAllData.rmp
- FilterHourAndSelectAttributes.rmp
- ImportData.rmp
When saving these, ensure the names are as above and they are all saved in the same repository. It's also important to point the processes at the locations of the training and test files. Download the training data from
here and the test data from
here. Unzip in the normal way and enter the locations into the "ReadAllData" process by changing the macros associated with the "ExecuteProcess" operator that runs the "ImportData" process.
When all the dust has settled, run the "EvoExample" process and observe the log output that writes a row each time a test has been performed with the specific settings of hour and misclassification cost.
These two parameters are contained in the depths of the process and the evolutionary process chooses values for these parameters and determines how they affect performance.
The process has a couple of interesting features. Firstly, the performance is extracted from a calculation to match the scoring used in the competition. The operator "Extract Performance" is used to do this. Secondly, the process shows a way of varying the value of a macro inside the evolutionary operator. This is a bit of a hack and involves using the varying parameter of the evolutionary operator to control the execution of the example set made by a "Generate Data" operator and then using other operators to extract the details of this dummy example set to place into a macro. Maybe there's an easier way; I couldn't find one.
The contest did reveal an issue with the way nominal values are read and assigned which show up when running in Windows and in Linux environments. I hope this will lead to an enhancement to allow more control to be exerted. In the meantime, the process tries to get round these by sorting and being explicit in assigning positive and negative labels. Despite this, there is a chance that different environments will yield different results. The interested reader is referred to this
process.
Have fun!...