Search this blog

Sunday, 12 June 2011

Counting clusters: part R

Here's an example process that uses an R script called from RapidMiner to perform clustering and provide a silhouette validity index.

As before, it uses the same artificial data as for the previous examples; namely 1000 data points in three dimensions with 8 clusters.

The script uses the R package "cluster". This contains the algorithm "partitioning around medoids" which the documentation describes as a more robust k-means.

The process iterates over values of k from 2 to 20, passes the data to R for clustering and generation of an average silhouette value. This allows the optimum value for k to be determined. The "correct" answer is 8 but this may not correspond to the best cluster using the validity measure owing to the random number generator which causes the clusters to differ between each run.

Some points to note
  • The R script installs the package and this causes a pop up dialog box to appear. Select the mirror from which to download the package. Comment out the "install.packages" line to stop this (or read the R documentation to work out how to test for the presence of the library before attempting the install).
  • The R script takes multiple inputs and these appear as R data frames.
  • The output from the R script must be a data frame.
  • The returned example set contains the validity measure and this is picked up for logging using the "Extract Performance" operator.
  • The order of the processes before the script is important to ensure the correct value of k is passed in. This can be done using the "show and alter operator execution order" feature on the process view.
  • The operator "Generate Data By User Specification" is used to create an example set to contain the value of k
The results should look something like this.


A value near 1 is what is being looked for and indicates compact, well separated clusters. The "correct" answer is 8 and the result supports this.

Saturday, 4 June 2011

Counting clusters: part IV

Here's an example process that uses the operator "Map Clustering on Labels" to match labels to clusters. If the data already has labels, the operator allows the output from a clustering algorithm to be assessed. It compares labels and clusters and determines the best match. From this, the output is a prediction and this can be passed to a performance operator to allow a confusion matrix to be created.

This allows a clustering algorithm to be used for prediction if training examples are available.