Search this blog

Wednesday, 30 March 2011

Renaming attributes with regular expressions

If you have an attribute with this name

Firstpart_Secondpart_Thirdpart

and you want to rename it to

Thirdpart_Secondpart_Firstpart

Use the 'Rename By Replacing' operator with this

(.*)_(.*)_(.*)

in the 'replace what' field

and

$3_$2_$1

in the 'replace by' field.

The brackets denote what is known as a capturing group so that everything inside that matches can be used later on by the use of the $1, $2 and $3 entries. The '.*' means match 0 or more characters and will continue until the '_' is found. The capturing group brackets mean that everything from the beginning of the string up to the character before the '_' will be placed into capturing group 1.

Tuesday, 22 March 2011

Counting clusters: part II

Here's an example that uses the k-means clustering algorithm to partition example sets into clusters. It uses the same generated data as here but this time it uses different cluster performance operators to determine how well the clustering works.

Specifically there are examples for
  • Davies-Bouldin
  • Average within centroid distance
  • Cluster density
  • Sum of squares item distribution
  • Gini item distribution
Plotting these measures against different values of k shows something like this.



Interpreting the shape of these graphs is complex and the subject for another day. In this case, the "right" answer is 8 and the measures don't contradict this.

As usual, the answer does not appear by magic and clustering requires a human to look at the results but the performance measures give a helping hand to focus attention to important areas.

Wednesday, 16 March 2011

Discretize by user specification: an example

The Discretize By User Specification operator allows numerical attributes to be placed in bins where the boundaries of the bins are defined by the user. This converts numerical attributes into nominal ones as required by some algorithms.

The following shows some example settings for the operator


The class names show the equality tests. The order of the list is important. The first entry must be the biggest, the last the smallest. Anything lower than the smallest entry is automatically less than "-Infinity". The upper case I on Infinity is important.

These example settings on a small example set are shown below.


The attribute "Copy of att1" is simply a copy of att1 before the discretization.

Tuesday, 8 March 2011

Generating reports in a RapidMiner process

Here's an example of a RapidMiner process that generates reports. One report is a PDF, the other is a static Web page.

The basic flow is to start with a "Generate Report" operator. This sets up the name, type and location of the report; in this case a pdf file. From then on, the name of the report is used by subsequent operators to add things to it. In the example, the next operator creates a "series multiple" plot of the attributes of the iris data set plotted against the label. From there, a page break is added and finally another plot is added, this time of the cluster determined from a k-means clustering algorithm. The final report is located in c:\temp\Iris.pdf.

Multiple reports can be created so it would be possible to have many reports created in a single process.

In the example, a "generate portal" operator creates another report and similar graphs are reported to it. In this case, a static web site is created at c:\temp\RMPortal\irisweb.html. In the example, a tab is created that contains text and it is shamelessly gratifying to note that this can be raw html.

Presumably, the enterprise versions have richer graphics and more control. RapidAnalytics would allow reports to be created in accordance with a schedule thereby creating a standalone Web site acting as a reporting portal.