Search this blog

Wednesday 13 February 2013

Sorting discretized examples in the order nature intended

Using the "Discretize" operators puts examples into different nominal bins depending on the value of an attribute. When using the long form of the name type, the possible nominal value names are of this general format "rangeN [x-y]"

N starts at 1 and ends at whatever the largest bin number is. The N is not preceded by any leading zeros so this means that when sorted, range10 comes before range2. When using the histogram plotter this is OK because the nominal values have an implicit order that gets used. When using the advanced plotter however, a histogram comes out wrong.

Here's a histogram, produced using the advanced plotter, showing the original ordering. The data is 10,000 examples generated by multiplying 5 random numbers together and normalizing to the range 0 to 1.

As can be seen, the ordering of the bins is not in the same numerical order of the underlying numerical values.

This can be fixed by using regular expressions and the "Replace" operator.

I'm not enough of a regular expression ninja to do this in one operator so I had to use two.

So, in the first "Replace" operator, set the "replace what" field to
range(\d+.*)
and set the replace by field to
range0000$1
This will change all the values to have leading zeros inserted before the number within the value.

In the second "Replace" operator, set the replace field to
range0+(\d{4})(.*)
and set the replace by field to
range$1$2
This ensures that all the numeric parts of the range name are of the same length and are preceded by at least one leading 0.

Be aware that you might have to tweak these numbers if the number of names is different in your case.

The end result is then a histogram like this


Now the ordering is the same as the implied numerical ordering.