Search this blog

Sunday, 10 April 2011

Counting clusters: part III

This process uses agglomerative clustering to create clusters.

As before, the same artificial data as with the k-means and DBScan approaches is used with some of the same cluster validity measures.

The agglomerative clustering operator can create its clusters in three modes. These are single link, average link and complete link. The process iterates over these options in addition to the value of k, the number of clusters.

To extract a cluster from an agglomerative model, the operator "Flatten clustering" must be used.

To make the process more efficient, it would be possible to run the clustering outside the loop operator but that is an exercise for another day. Using the date_now() function within a "Generate attributes" operator would be one way to allow this to be timestamped for those interested in the improvement.

This process also uses the tricky "Pivot" operator. This allows the validity measures for the different modes to be presented as attributes in their own right. To perform this operation, it is necessary to use the "Log to data" operator that turns everything that has been logged into an example set.

Here is an example graph to show how the validity measures vary as k varies. As before, the "correct" answer is 8.

No comments:

Post a Comment