Search this blog

Monday, 3 January 2011

K-distances plots

The operator "Data to Similarity" calculates how similar each example is to each other example within an example set. Here's an example using 5 examples with euclidean distance as the measure.

Firstly, the examples...

Now the distances...

A k-distance plot displays, for a given value of k, what the distances are from all points to the kth nearest. These are sorted and plotted.

For k = 2, which is equivalent to the nearest neighbour, the nearest distances for each id are
  1. 0.014
  2. 0.014
  3. 0.177
  4. 0.378
  5. 0.400
(edit: I made a mistake previously, items 3, 4 and 5 were wrong)

The plot looks like this

The smallest value is to the right rather than starting at the left near the origin.

These plots can be used to determine choices for the epsilon parameter in the DBScan clustering operator.

Some more notes about this to follow...


  1. What does k mean here, does it mean among the distance of an id, pick the kth nearest distance? Or anything else.
    If its so are the figure right?
    Shouldn't the kth distance of all points should be like:
    1 0.178
    2 0.177
    3 0.178
    4 0.396
    5 0.422

    I am sorry I know I have missed something. Could you please clarify it.

  2. I made a mistake in the original post which I've corrected now - sorry about that.

    The distance to the kth nearest neighbour when k = 2 is the distances between these pairs
    1 and 2 ... 0.014
    2 and 1 ... 0.014
    3 and 2 ... 0.177
    4 and 3 ... 0.378
    5 and 4 ... 0.400

  3. Oh got it thanks,so kdist k=1, means the distance with itself.

    Is there a way to plot this and get the value of epsilon in R? That is my concern.
    I will really appreciate if you could help.

  4. There is never a straightforward answer but there is something in the following

    Cluster analysis: Basic concepts and algorithms, chapter 8, Tan, P.N. and Steinbach, M. and Kumar, V. 2006

    The crucial sentence from this is

    "...if we compute the k-dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k-dist that corresponds to a suitable value of Eps."