[R] Cluster analysis, defining center seeds or number of clusters

amvds at xs4all.nl amvds at xs4all.nl
Thu Jun 11 17:14:50 CEST 2009


I use kmeans to classify spectral events in high and low 1/3 octave bands:

#Do cluster analysis
CyclA<-data.frame(LlowA,LhghA)
CntrA<-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
ClstA<-kmeans(CyclA,centers=CntrA,nstart=50,algorithm="MacQueen")

This works well when the actual data shows 1,2 or 3 groups that are not
"too close" in a cross plot. The MacQueen algorithm will give one or more
empty groups which is what I want.

However, there are cases when the groups are closer together, less compact
or diffuse which leads to the situation where visually only 2 groups are
apparent but the algorithm returns 3 splitting one group in two.

I looked at the package 'cluster' specifically at clara (cannot use pam as
I have 10000 observations). But clara always returns as many groups as you
aks for.

Is there a way to help find a seed for the intial cluster centers?
Equivalently, is there a way to find a priori the number of groups?

I know this is not an easy problem. I have looked at principal components
(princomp, prcomp) because there is a connection with cluster analysis. It
is not obvious to me how to program that connection though.

http://en.wikipedia.org/wiki/Principal_Component_Analysis
http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf

Thanks in advance,
Alex van der Spek




More information about the R-help mailing list