As before, it uses the same artificial data as for the previous examples; namely 1000 data points in three dimensions with 8 clusters.
The script uses the R package "cluster". This contains the algorithm "partitioning around medoids" which the documentation describes as a more robust k-means.
The process iterates over values of k from 2 to 20, passes the data to R for clustering and generation of an average silhouette value. This allows the optimum value for k to be determined. The "correct" answer is 8 but this may not correspond to the best cluster using the validity measure owing to the random number generator which causes the clusters to differ between each run.
Some points to note
- The R script installs the package and this causes a pop up dialog box to appear. Select the mirror from which to download the package. Comment out the "install.packages" line to stop this (or read the R documentation to work out how to test for the presence of the library before attempting the install).
- The R script takes multiple inputs and these appear as R data frames.
- The output from the R script must be a data frame.
- The returned example set contains the validity measure and this is picked up for logging using the "Extract Performance" operator.
- The order of the processes before the script is important to ensure the correct value of k is passed in. This can be done using the "show and alter operator execution order" feature on the process view.
- The operator "Generate Data By User Specification" is used to create an example set to contain the value of k
A value near 1 is what is being looked for and indicates compact, well separated clusters. The "correct" answer is 8 and the result supports this.