Search this blog

Sunday 12 June 2011

Counting clusters: part R

Here's an example process that uses an R script called from RapidMiner to perform clustering and provide a silhouette validity index.

As before, it uses the same artificial data as for the previous examples; namely 1000 data points in three dimensions with 8 clusters.

The script uses the R package "cluster". This contains the algorithm "partitioning around medoids" which the documentation describes as a more robust k-means.

The process iterates over values of k from 2 to 20, passes the data to R for clustering and generation of an average silhouette value. This allows the optimum value for k to be determined. The "correct" answer is 8 but this may not correspond to the best cluster using the validity measure owing to the random number generator which causes the clusters to differ between each run.

Some points to note
  • The R script installs the package and this causes a pop up dialog box to appear. Select the mirror from which to download the package. Comment out the "install.packages" line to stop this (or read the R documentation to work out how to test for the presence of the library before attempting the install).
  • The R script takes multiple inputs and these appear as R data frames.
  • The output from the R script must be a data frame.
  • The returned example set contains the validity measure and this is picked up for logging using the "Extract Performance" operator.
  • The order of the processes before the script is important to ensure the correct value of k is passed in. This can be done using the "show and alter operator execution order" feature on the process view.
  • The operator "Generate Data By User Specification" is used to create an example set to contain the value of k
The results should look something like this.


A value near 1 is what is being looked for and indicates compact, well separated clusters. The "correct" answer is 8 and the result supports this.

17 comments:

  1. Hi Andrew,
    Thank you very much!! It is a very valuable blob post indeed. I was able to map your example on a text clustering project. I am representing people by their text (what they write in comments) so every user is a term weight vector, where each element in that vector is a TFIDF term weight, in this case my attributes are also numerical, exactly as your example. However, the Silhouette validity approach didn't give a clear "correct" number of clusters as it is clear in this hypothetical example. I only got a linear correlation between k and the silhouette index. The larger k the better (closer to 1.00) the silhouette is. The number of attributes in my case is 674 and may be that's the reason. But real-life data mining always contain that big number of attributes! What do you think I can do to determine the correct number of clusters? Can dimension reduction work? Have you ever done text clustering and applied cluster validity measures to determine your right k?
    Many thanks again,
    Great Blog!
    Ahmad

    ReplyDelete
  2. Hello Ahmad,

    Thanks for the thanks!

    I haven't personally clustered text documents but large numbers of attributes do present a problem. Principal component analysis is one approach to reduce the number of attributes. You could look at correlation to eliminate those that are closely correlated although this is not guaranteed to help.

    With a large number of attributes it can be difficult to work out what attribute drives membership of what cluster.

    As it happens, this formed the subject of a paper I presented at the recent RapidMiner conference.

    http://www.amazon.de/Proceedings-RapidMiner-Community-Meeting-Conference/dp/3844000933/

    One thing I tried in this was to use a decision tree algorithm on the cluster recast as a label to see if the resultant tree gives any insight about which attribute drives cluster membership.

    regards

    Andrew

    ReplyDelete
  3. Hi I really appreciate your posts. I have one question about R integration. You show how to use a dataset from rapidminer, cluster and valuate it in R. My probem is that I want to make clustering in rapidminer and then import the cluster set generated in R to calculate some statistics that are not included in rapidminer like rand index, silhouette, etc. Do you known how to import the cluster set from rapidminer operator to R? Thanks very much.

    ReplyDelete
  4. @N@v

    I've done something similar, I'll search around for the example and post it.

    Andrew

    ReplyDelete
  5. I found this fragment of R code from a larger process that I can't publish for various reasons...

    Hopefully this will give you a start.


    library(mclust)
    library(profdpm)

    ## "data" is defined as the input in the inputs
    ## each column is referred to by name using $
    ## because the input is a data frame.
    ARI = adjustedRandIndex(data$cluster1, data$cluster2)
    ARI = as.data.frame(ARI)
    pci=as.data.frame(t(pci(data$cluster1, data$cluster2)))


    ## using the variable ari sets the name of the returned data frame column
    ## and avoids having to use a rename process
    ## "x" is the output defined in the results
    ## it must be a data frame
    x = as.data.frame(cbind(ARI,pci))

    ReplyDelete
  6. hi andrew
    i would like to aks where you have this dataset?
    i would like to use it for my experiments too

    ReplyDelete
    Replies
    1. Hello

      I edited the post above to point to one of the previous examples where the artificial data is available.

      Andrew

      Delete
  7. Thank you. It's a very nice RM-Workflow. You've helped me a lot.

    ReplyDelete
  8. Do you have any idea, why the highest silhouette-value reached for k = 2 in IRIS-testset. The process delivers the wrong answer. It's well known, that it is three.

    ReplyDelete
    Replies
    1. I tried the PAM algorithm on the Iris data in R directly and I got

      k cluster$silinfo$avg.width
      2 0.69
      3 0.55
      4 0.49
      5 0.49
      6 0.47

      So I can't explain why 3 is not shown as the maximum when it "should" be. I suppose it underlines the importance of being careful about cluster validity measures since they are not guaranteed to give the right answer, merely give guidance about the clusterings that appear better.

      Delete
  9. Hello,
    I tried to adapt this work flow on my data (65000 row, 140 cols).
    I'm getting an error "The input example set has less than 0 examples. .... Offending operator: averageSilhouette"
    Things I changed in the work flow are:
    1. Retrieve data (from db)
    2. Set Role ID to ID column
    2. Replace missing values
    3. Remove useless attribute (1 attr)
    4. Then continues your workflow (Loop)

    Thanks,

    ReplyDelete
  10. Hello

    Difficult to be certain but it looks like the attributes are filtered out by the "Select Attributes" operator.

    Andrew

    ReplyDelete
  11. Hi Andrew,
    Thanks for the response.
    I checked again the workflow and I found that my mistake was on R script operator. I forgot to change " ... dataNames[1:140] ..." . I successfully run the work flow on different number of examples 1000,2000,... (even 10000). But when I remove the limit trying the whole dataset I got the same error as before.

    ReplyDelete
  12. Hello Berat

    I found the problem, it's a memory issue in R.

    The RapidMiner log contained this...

    Execute Script (3): 3: In double(1 + (n * (n - 1))/2) :
    Reached total allocation of 2047Mb: see help(memory.size)

    ...and sure enough, if you repeat the "PAM" command in R manually, you get an error message when the input data gets too big. To fix it would require use of the R function memory.limit(size=nnnn) with the value being determined in your environment.


    regards

    Andrew

    ReplyDelete
  13. Hi Andrew,
    Thanks for your response.
    I tried to play with different settings with memory limit but unsuccessful.(Intel Core I7, 8CPUs x 1.6GHz, 6 GB RAM)

    I ended up with modification of R script to use CLARA instead of PAM, just for experiment.

    ....
    clusteredData = clara(data[dataNames[1:length(dataNames)]], kClusters, metric = "euclidean", stand = FALSE, samples = 1, sampsize = 10000, trace = 0, medoids.x = FALSE, keep.data = FALSE, rngR = FALSE)
    ....

    The process continued smoothly and finished after around 1 hour and half.
    I don't know exactly how clara works (I hope I didn't misused) but I found that many people recommends it for large datasets.

    However many thanks for your posts and feedback.
    Regards,

    ReplyDelete