Firstly, the examples...
Now the distances...
A k-distance plot displays, for a given value of k, what the distances are from all points to the kth nearest. These are sorted and plotted.
For k = 2, which is equivalent to the nearest neighbour, the nearest distances for each id are
- 0.014
- 0.014
- 0.177
- 0.378
- 0.400
The plot looks like this
The smallest value is to the right rather than starting at the left near the origin.
These plots can be used to determine choices for the epsilon parameter in the DBScan clustering operator.
Some more notes about this to follow...
What does k mean here, does it mean among the distance of an id, pick the kth nearest distance? Or anything else.
ReplyDeleteIf its so are the figure right?
Shouldn't the kth distance of all points should be like:
1 0.178
2 0.177
3 0.178
4 0.396
5 0.422
I am sorry I know I have missed something. Could you please clarify it.
I made a mistake in the original post which I've corrected now - sorry about that.
ReplyDeleteThe distance to the kth nearest neighbour when k = 2 is the distances between these pairs
1 and 2 ... 0.014
2 and 1 ... 0.014
3 and 2 ... 0.177
4 and 3 ... 0.378
5 and 4 ... 0.400
Oh got it thanks,so kdist k=1, means the distance with itself.
ReplyDeleteIs there a way to plot this and get the value of epsilon in R? That is my concern.
I will really appreciate if you could help.
There is never a straightforward answer but there is something in the following
ReplyDeleteCluster analysis: Basic concepts and algorithms, chapter 8, Tan, P.N. and Steinbach, M. and Kumar, V. 2006
The crucial sentence from this is
"...if we compute the k-dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k-dist that corresponds to a suitable value of Eps."