[R] Recommending number of clusters

Chua Siang Li siang.li.chua at acceval-intl.com
Fri Jun 20 03:04:40 CEST 2008


   Hello there.  I wanted to segment customers and is trying both kmeans and
   HCA method in R.
   As I don't know how many cluster is good for a set of data, in kmeans, I
   tried  the code below and get a nice plot on number of cluster vs SSE.
   Looking at the slope between cluster number, I have a guide on how many
   cluster should there be.
   dis <- as.matrix(daisy(x, metric = "gower", stand=TRUE))  #x is my raw data
   result <- matrix(NA, 9, 12)
     for (i in 1:9)
     {
       j <- i+1
       modelkmeans = kmeans(dis, j)
       result[i,1] <- j
       result[i,2] <- mean(modelkmeans$withinss)
       for (k in 1:j)
         result[i,k+2] <- modelkmeans$size[k]
     }
     plot(result, main = "Internal Index of K-means Clustering", sub = "",
          xlab = "Number of clusters",
          ylab = "Sum of Squared Error (SSE)", col = "blue")
   Question:
   1. Am I doing the right thing in kmeans?, as I am a novice in stats.
   2. Do I do the same thing for HCA but instead of SSE, I do a plot of number
   of clusters vs AC?
   3. After obtaining the 2 results, 1 from kmeans and 1 from HCA, is there a
   way that I can compare which set of results is 'better'?
   4. Is there any other methods on how to recommend number of clusters?
   Many thanks.
   siangli


More information about the R-help mailing list