Search this blog

Showing posts with label Replace. Show all posts
Showing posts with label Replace. Show all posts

Wednesday, 13 February 2013

Sorting discretized examples in the order nature intended

Using the "Discretize" operators puts examples into different nominal bins depending on the value of an attribute. When using the long form of the name type, the possible nominal value names are of this general format "rangeN [x-y]"

N starts at 1 and ends at whatever the largest bin number is. The N is not preceded by any leading zeros so this means that when sorted, range10 comes before range2. When using the histogram plotter this is OK because the nominal values have an implicit order that gets used. When using the advanced plotter however, a histogram comes out wrong.

Here's a histogram, produced using the advanced plotter, showing the original ordering. The data is 10,000 examples generated by multiplying 5 random numbers together and normalizing to the range 0 to 1.

As can be seen, the ordering of the bins is not in the same numerical order of the underlying numerical values.

This can be fixed by using regular expressions and the "Replace" operator.

I'm not enough of a regular expression ninja to do this in one operator so I had to use two.

So, in the first "Replace" operator, set the "replace what" field to
range(\d+.*)
and set the replace by field to
range0000$1
This will change all the values to have leading zeros inserted before the number within the value.

In the second "Replace" operator, set the replace field to
range0+(\d{4})(.*)
and set the replace by field to
range$1$2
This ensures that all the numeric parts of the range name are of the same length and are preceded by at least one leading 0.

Be aware that you might have to tweak these numbers if the number of names is different in your case.

The end result is then a histogram like this


Now the ordering is the same as the implied numerical ordering.

Thursday, 7 July 2011

Using regular expressions with the Replace (Dictionary) operator

The "Replace (Dictionary)" operator replaces occurrences of one nominal in one example set with another looked up from another example set.

By default, this replaces a continuous sequence regardless of its position in the nominal.

For example, if an attribute in the main example set contains the value "network" and the dictionary example set contains the value pair "work", "banana", the result of the operation would be "netbanana".

This is fine but if you want to limit to whole words only then you can use the "use regular expressions" parameter in the replace operator. To make this work, you also have to change the text within the dictionary for the nominal to be replaced with "\b" at the beginning and the end. In regular expression speak, this means match a whole word only.

In addition, if the word to be replaced contains reserved characters (from a regular expressions perspective) then "\Q" and "\E" have to be placed around the word.

One way to do this is to use a "generate attributes" operator and create a new attribute in the dictionary example set using the following expression.

"\\b\\Q"+word+"\\E\\b"
In this case, "word" is the attribute containing the word to be replaced. The "\" must be escaped with an additional "\" in order for it all to come out correctly.

The new attribute would then be used in the "from attribute" parameter of the "Replace (Dictionary)" operator. The "to attribute" would be set to the attribute within the example set dictionary that is the replacement value.