Data Science With RapidMiner: Measuring operator performance for different sizes of data

Sunday, 8 May 2011

Measuring operator performance for different sizes of data

This process measures the performance of a modelling operator as more and more examples are added. The resultant example set of performance values could be used with a regression model to allow predictions to be made about how a given model will perform with a large data set.

The following values are available for logging with the "Log" operator.

cpu-execution-time
cpu-time
execution-time
looptime
time

These correspond to the time taken to execute the process once.

Investigations show that cpu-execution-time and cpu-time are identical and these look like they are measured in microseconds.

Execution-time, looptime and time are also identical to one another and these look like they are measured in milliseconds.

Furthermore, the microsecond values are almost always 1,000 times the millisecond values and therefore, only one of these values is required to allow the measurement to be taken.

The process uses the "Select Subprocess" operator to make it easier to change the model algorithm so avoiding having to edit the "Log" operator. Experiments showed that the timings from the select operator were, to all intents and purposes, identical to those from the enclosed operator.

Performance is not determined solely by the process, there are inevitably external factors that will impact it as other things happen on the computer. The process therefore runs each iteration for a given sample size multiple times to obtain a more statistically meaningful result.

Here is an example quartile colour plot showing performance of a neural network as the number of rows is increased from 10,000 to 100,000 in increments of 10,000.