Search this blog

Sunday, 3 April 2011

How average within cluster distance is calculated for numeric attributes

The performance for each cluster is calculated first. The cluster density performance operator calculates the average distance between points in a cluster and multiplies this by the number of points minus 1. Euclidean distance is used as the distance measure.

So a cluster containing 1 point would have an average distance of 0 because there no other points.

A cluster containing 2 points would have a density equal to the distance between the points.

Here's a worked example for 3 points. Firstly, here are the points

 This table shows the Euclidean distance between them

For example SqrRoot ((-8.625-2.468)^2 + (-5.590-7.267)^2) = 16.981

The average distance of these is 13.726 and the average within distance performance if these three points were in a cluster would be (3-1) times that or 27.452. I'm not quite sure why this multiplication happens (and I have done experiments using 4 points to confirm it and I've looked at the code). I'll leave this comment as a place holder for another day.

Here is what RapidMiner shows.

The negative value is imposed by RapidMiner and seems to be because the best density is the smallest absolute value so negating this means the best density would be the maximum enabling it to be used as a stopping criterion during optimisation.

The performance for all clusters is calculated by summing each cluster performance weighted by the number of points in each and dividing by the number of examples. For example, three clusters with performances -9.066, 0 and -8.943 containing 2, 1 and 3 points respectively would give a result of -7.493 i.e. (2*-9.066+1*0+3*-8.943)/(2+1+3).

What these values mean is a story for another day.

1 comment:

  1. Nice to be mentioned :)