[R] Recommending number of clusters
Chua Siang Li
siang.li.chua at acceval-intl.com
Fri Jun 20 03:04:40 CEST 2008
Hello there. I wanted to segment customers and is trying both kmeans and
HCA method in R.
As I don't know how many cluster is good for a set of data, in kmeans, I
tried the code below and get a nice plot on number of cluster vs SSE.
Looking at the slope between cluster number, I have a guide on how many
cluster should there be.
dis <- as.matrix(daisy(x, metric = "gower", stand=TRUE)) #x is my raw data
result <- matrix(NA, 9, 12)
for (i in 1:9)
{
j <- i+1
modelkmeans = kmeans(dis, j)
result[i,1] <- j
result[i,2] <- mean(modelkmeans$withinss)
for (k in 1:j)
result[i,k+2] <- modelkmeans$size[k]
}
plot(result, main = "Internal Index of K-means Clustering", sub = "",
xlab = "Number of clusters",
ylab = "Sum of Squared Error (SSE)", col = "blue")
Question:
1. Am I doing the right thing in kmeans?, as I am a novice in stats.
2. Do I do the same thing for HCA but instead of SSE, I do a plot of number
of clusters vs AC?
3. After obtaining the 2 results, 1 from kmeans and 1 from HCA, is there a
way that I can compare which set of results is 'better'?
4. Is there any other methods on how to recommend number of clusters?
Many thanks.
siangli
More information about the R-help
mailing list