Data Science With RapidMiner: What does X-Validation do?

Sunday, 9 January 2011

What does X-Validation do?

The model from X-Validation applied to all the test data will not match the performance estimate given by the same operator.

This is correct behaviour but can be confusing.

Following on from this thread on the RapidMiner forum my explanation of this is as follows.

A 10 fold cross validation on the Iris data set using Decision Trees produces a performance estimate of 93.33% +/- 5.16%. If you use the model produced by this operator on the whole dataset you get a performance of 94.67% (note: in fact, this model is the same as one produced by using the whole dataset).

Which performance number should be believed and why is there a difference?

Imagine someone were to produce a new Iris data set. If we use the complete model to predict the new data, what performance would we get? Would 94.67% or 93.33% be more realistic? There is no right answer but the 93.33% figure is probably more likely to be realistic because this is what the performance estimate is providing. The whole point of the X-Validation operator is to do this; namely estimate what the performance would be on unseen data; outputting a model is not its main aim.

It works by splitting the data into 10 (this is the default) partitions each containing 90% of the original data. It then builds a model on each 90% and applies it to the 10% to get a performance. It does this for all 10 partitions to get an average.

To illustrate with an example, the following 10 performance values are produced inside the operator (I'm using accuracy from the Performance Classification operator and the random number seed for the whole process is 2001).

0.867
0.933
0.933
1.0
0.933
1.0
0.933
1.0
0.867
0.867

The arithmetic mean of these is 0.933 and this matches the performance output by the X-Validation operator; namely 93.33%. The stdevp of these values is 0.051511; this matches the 5.16% error output by the operator. This means the X-Validation operator averages the performance for all the iterations inside the operator (note: if the performance result shows an error estimate then you know that some averaging has been done).

Each of the 10 models may be different. This is inevitable because the data in each partition is different. For example, manual inspection of the models produced inside the operator leads to differences as shown below.

The final model happens to be the same as the first of these but there is no reason to suppose that it would be in general. In practice, the models are likely to be the same in all cases, but it is worth bearing in mind that they might not be.

Nonetheless, the end result is likely to represent the performance of a model built using known data and applied to unknown data. Simply building a model on all data and then applying it to this data may result in a performance that is higher because of overfitting. Using this latter prediction is not as good because no unseen data has been used to make any models.