Data Science With RapidMiner: Counting clusters: part R

Sunday, 12 June 2011

Counting clusters: part R

Here's an example process that uses an R script called from RapidMiner to perform clustering and provide a silhouette validity index.

As before, it uses the same artificial data as for the previous examples; namely 1000 data points in three dimensions with 8 clusters.

The script uses the R package "cluster". This contains the algorithm "partitioning around medoids" which the documentation describes as a more robust k-means.

The process iterates over values of k from 2 to 20, passes the data to R for clustering and generation of an average silhouette value. This allows the optimum value for k to be determined. The "correct" answer is 8 but this may not correspond to the best cluster using the validity measure owing to the random number generator which causes the clusters to differ between each run.

Some points to note

The R script installs the package and this causes a pop up dialog box to appear. Select the mirror from which to download the package. Comment out the "install.packages" line to stop this (or read the R documentation to work out how to test for the presence of the library before attempting the install).
The R script takes multiple inputs and these appear as R data frames.
The output from the R script must be a data frame.
The returned example set contains the validity measure and this is picked up for logging using the "Extract Performance" operator.
The order of the processes before the script is important to ensure the correct value of k is passed in. This can be done using the "show and alter operator execution order" feature on the process view.
The operator "Generate Data By User Specification" is used to create an example set to contain the value of k

The results should look something like this.

A value near 1 is what is being looked for and indicates compact, well separated clusters. The "correct" answer is 8 and the result supports this.

17 comments:

Angelo_Arts2 September 2011 at 17:40
Hi Andrew,
Thank you very much!! It is a very valuable blob post indeed. I was able to map your example on a text clustering project. I am representing people by their text (what they write in comments) so every user is a term weight vector, where each element in that vector is a TFIDF term weight, in this case my attributes are also numerical, exactly as your example. However, the Silhouette validity approach didn't give a clear "correct" number of clusters as it is clear in this hypothetical example. I only got a linear correlation between k and the silhouette index. The larger k the better (closer to 1.00) the silhouette is. The number of attributes in my case is 674 and may be that's the reason. But real-life data mining always contain that big number of attributes! What do you think I can do to determine the correct number of clusters? Can dimension reduction work? Have you ever done text clustering and applied cluster validity measures to determine your right k?
Many thanks again,
Great Blog!
Ahmad
ReplyDelete
Replies
Andrew2 September 2011 at 18:20
Hello Ahmad,

Thanks for the thanks!

I haven't personally clustered text documents but large numbers of attributes do present a problem. Principal component analysis is one approach to reduce the number of attributes. You could look at correlation to eliminate those that are closely correlated although this is not guaranteed to help.

With a large number of attributes it can be difficult to work out what attribute drives membership of what cluster.

As it happens, this formed the subject of a paper I presented at the recent RapidMiner conference.

http://www.amazon.de/Proceedings-RapidMiner-Community-Meeting-Conference/dp/3844000933/

One thing I tried in this was to use a decision tree algorithm on the cluster recast as a label to see if the resultant tree gives any insight about which attribute drives cluster membership.

regards

Andrew
ReplyDelete
Replies
N@v23 December 2011 at 15:50
Hi I really appreciate your posts. I have one question about R integration. You show how to use a dataset from rapidminer, cluster and valuate it in R. My probem is that I want to make clustering in rapidminer and then import the cluster set generated in R to calculate some statistics that are not included in rapidminer like rand index, silhouette, etc. Do you known how to import the cluster set from rapidminer operator to R? Thanks very much.
ReplyDelete
Replies
Andrew24 December 2011 at 09:32
@N@v

I've done something similar, I'll search around for the example and post it.

Andrew
ReplyDelete
Replies
Andrew24 December 2011 at 14:48
I found this fragment of R code from a larger process that I can't publish for various reasons...

Hopefully this will give you a start.

library(mclust)
library(profdpm)

## "data" is defined as the input in the inputs
## each column is referred to by name using $
## because the input is a data frame.
ARI = adjustedRandIndex(data$cluster1, data$cluster2)
ARI = as.data.frame(ARI)
pci=as.data.frame(t(pci(data$cluster1, data$cluster2)))

## using the variable ari sets the name of the returned data frame column
## and avoids having to use a rename process
## "x" is the output defined in the results
## it must be a data frame
x = as.data.frame(cbind(ARI,pci))
ReplyDelete
Replies
Anonymous23 August 2012 at 11:06
hi andrew
i would like to aks where you have this dataset?
i would like to use it for my experiments too
ReplyDelete
Replies
Anonymous11 December 2012 at 10:08
Thank you. It's a very nice RM-Workflow. You've helped me a lot.
ReplyDelete
Replies
Andrew11 December 2012 at 22:30
Glad it helped!
ReplyDelete
Replies
Smerg13 December 2012 at 14:38
Do you have any idea, why the highest silhouette-value reached for k = 2 in IRIS-testset. The process delivers the wrong answer. It's well known, that it is three.
ReplyDelete
Replies
berat29 December 2012 at 14:51
Hello,
I tried to adapt this work flow on my data (65000 row, 140 cols).
I'm getting an error "The input example set has less than 0 examples. .... Offending operator: averageSilhouette"
Things I changed in the work flow are:
1. Retrieve data (from db)
2. Set Role ID to ID column
2. Replace missing values
3. Remove useless attribute (1 attr)
4. Then continues your workflow (Loop)

Thanks,
ReplyDelete
Replies
Andrew30 December 2012 at 00:19
Hello

Difficult to be certain but it looks like the attributes are filtered out by the "Select Attributes" operator.

Andrew
ReplyDelete
Replies
berat31 December 2012 at 13:29
Hi Andrew,
Thanks for the response.
I checked again the workflow and I found that my mistake was on R script operator. I forgot to change " ... dataNames[1:140] ..." . I successfully run the work flow on different number of examples 1000,2000,... (even 10000). But when I remove the limit trying the whole dataset I got the same error as before.
ReplyDelete
Replies
Andrew1 January 2013 at 12:49
Hello Berat

I found the problem, it's a memory issue in R.

The RapidMiner log contained this...

Execute Script (3): 3: In double(1 + (n * (n - 1))/2) :
Reached total allocation of 2047Mb: see help(memory.size)

...and sure enough, if you repeat the "PAM" command in R manually, you get an error message when the input data gets too big. To fix it would require use of the R function memory.limit(size=nnnn) with the value being determined in your environment.

regards

Andrew
ReplyDelete
Replies
berat3 January 2013 at 20:06
Hi Andrew,
Thanks for your response.
I tried to play with different settings with memory limit but unsuccessful.(Intel Core I7, 8CPUs x 1.6GHz, 6 GB RAM)

I ended up with modification of R script to use CLARA instead of PAM, just for experiment.

....
clusteredData = clara(data[dataNames[1:length(dataNames)]], kClusters, metric = "euclidean", stand = FALSE, samples = 1, sampsize = 10000, trace = 0, medoids.x = FALSE, keep.data = FALSE, rngR = FALSE)
....

The process continued smoothly and finished after around 1 hour and half.
I don't know exactly how clara works (I hope I didn't misused) but I found that many people recommends it for large datasets.

However many thanks for your posts and feedback.
Regards,
ReplyDelete
Replies

Add comment

Data Science With RapidMiner

Search this blog

Sunday, 12 June 2011

Counting clusters: part R

17 comments:

About Me

Labels

Blog Archive