Basically algorithm tries to group data points into several separate groups (clusters). Every group (cluster) has its own centroid (point in the centre of the cluster space). And clustering is optimal when distance of points in each cluster from its centroid is as small as possible.

Centroids are newly created data points. There is a variant of this algorithm – so called “k-medoids clustering” which uses only already existing data points.

“K-means” is NON-hierarchical clustering – clusters are separate.

Implementation in R: command “kmeans”

Very simple example with vector – we examine runtime of aggregations on database and want to find some patterns :

> aggtime<-c(1.1,1.9,1.5,0.9,1.8,5.5,6.1,7.8,25,26.8,28,55,48,64)

Simple plot ( plot(aggtime) ) will not show it in the best way.

Therefore we can plot in “in line” using:

plot(aggtime, rep.int(0,length(aggtime)))

Now we can see groups (clusteres) in better way. But this is only very simple scholastioc example!

We can try k-means algorithm with different values to see if it helps us to find patterns.

Let’s try 3 clusters:

> fit3<-kmeans(aggtime,3) > fit3 K-means clustering with 3 clusters of sizes 3, 3, 8 Cluster means: [,1] 1 26.60000 2 55.66667 3 3.32500 Clustering vector: [1] 3 3 3 3 3 3 3 3 1 1 1 2 2 2 Within cluster sum of squares by cluster: [1] 4.5600 128.6667 50.9750 (between_SS / total_SS = 97.1 %) Available components: [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" "iter" [9] "ifault"

And now 4 clusters:

> fit4<-kmeans(aggtime,4) > fit4 K-means clustering with 4 clusters of sizes 3, 3, 3, 5 Cluster means: [,1] 1 6.466667 2 55.666667 3 26.600000 4 1.440000 Clustering vector: [1] 4 4 4 4 4 1 1 1 3 3 3 2 2 2 Within cluster sum of squares by cluster: [1] 2.846667 128.666667 4.560000 0.752000 (between_SS / total_SS = 97.8 %) Available components: [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" "iter" [9] "ifault"

Now 5 clusters:

> fit5<-kmeans(aggtime,5) > fit5 K-means clustering with 5 clusters of sizes 3, 1, 3, 5, 2 Cluster means: [,1] 1 55.666667 2 25.000000 3 6.466667 4 1.440000 5 27.400000 Clustering vector: [1] 4 4 4 4 4 3 3 3 2 5 5 1 1 1 Within cluster sum of squares by cluster: [1] 128.666667 0.000000 2.846667 0.752000 0.720000 (between_SS / total_SS = 97.9 %) Available components: [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" "iter" [9] "ifault"

And now 10 clusters:

> fit10<-kmeans(aggtime,10) > fit10 K-means clustering with 10 clusters of sizes 2, 2, 1, 2, 1, 2, 1, 1, 1, 1 Cluster means: [,1] 1 27.40 2 1.00 3 1.50 4 1.85 5 64.00 6 5.80 7 25.00 8 7.80 9 48.00 10 55.00 Clustering vector: [1] 2 4 3 2 4 6 6 8 7 1 1 10 9 5 Within cluster sum of squares by cluster: [1] 0.720 0.020 0.000 0.005 0.000 0.180 0.000 0.000 0.000 [10] 0.000 (between_SS / total_SS = 100.0 %) Available components: [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" "iter" [9] "ifault"

As we can see with 10 clusters we have theoretically 100% fit but in reality this is a nonsense – 10 clunsters on 14 data points simply does not make sense. Also 5 clusters is too high number for this example. Therefore our best fit is 4 clusters.