Data Science With RapidMiner: K-distances plots

Monday, 3 January 2011

K-distances plots

The operator "Data to Similarity" calculates how similar each example is to each other example within an example set. Here's an example using 5 examples with euclidean distance as the measure.

Firstly, the examples...

Now the distances...

A k-distance plot displays, for a given value of k, what the distances are from all points to the kth nearest. These are sorted and plotted.

For k = 2, which is equivalent to the nearest neighbour, the nearest distances for each id are

0.014
0.014
0.177
0.378
0.400

(edit: I made a mistake previously, items 3, 4 and 5 were wrong)

The plot looks like this

The smallest value is to the right rather than starting at the left near the origin.

These plots can be used to determine choices for the epsilon parameter in the DBScan clustering operator.

Some more notes about this to follow...

4 comments:

Anonymous25 February 2013 at 11:43
What does k mean here, does it mean among the distance of an id, pick the kth nearest distance? Or anything else.
If its so are the figure right?
Shouldn't the kth distance of all points should be like:
1 0.178
2 0.177
3 0.178
4 0.396
5 0.422

I am sorry I know I have missed something. Could you please clarify it.
ReplyDelete
Replies
Andrew25 February 2013 at 12:22
I made a mistake in the original post which I've corrected now - sorry about that.

The distance to the kth nearest neighbour when k = 2 is the distances between these pairs
1 and 2 ... 0.014
2 and 1 ... 0.014
3 and 2 ... 0.177
4 and 3 ... 0.378
5 and 4 ... 0.400

ReplyDelete
Replies
Anonymous25 February 2013 at 13:52
Oh got it thanks,so kdist k=1, means the distance with itself.

Is there a way to plot this and get the value of epsilon in R? That is my concern.
I will really appreciate if you could help.
ReplyDelete
Replies
Andrew26 February 2013 at 16:45
There is never a straightforward answer but there is something in the following

Cluster analysis: Basic concepts and algorithms, chapter 8, Tan, P.N. and Steinbach, M. and Kumar, V. 2006

The crucial sentence from this is

"...if we compute the k-dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k-dist that corresponds to a suitable value of Eps."
ReplyDelete
Replies

Add comment

Data Science With RapidMiner

Search this blog

Monday, 3 January 2011

K-distances plots

4 comments:

About Me

Labels

Blog Archive