Search this blog

Thursday 9 July 2015

Finding quartiles

Here's a process that finds the upper, middle and lower quartiles of a real valued special attribute within an example set and discretizes all the real values into the corresponding bins. It assumes there is one special attribute only. Additional special attributes would need to be de-selected as an extra step before being processed.

The process works as follows. After sorting the example set, it uses various macro extraction and manipulation operators to work out how many examples there are, determine the index corresponding to the quartile locations and from there the values of the attributes at these locations. These values are set as macros that are used in the "Discretize by User Specification" operator as boundaries between the quartile ranges in order to place each example into the correct bin.

The main work happens in a subprocess which makes the process easier to read and allows the operators to be moved to other processes more easily. The very useful operator "Rename by Generic Names" is used. This allows the macro manipulation operators to work without having to be concerned about the name of the special attribute which again allows the operators to be more portable when used in other processes.