[R] Cluster analysis, defining center seeds or number of clusters

Thu Jun 11 18:41:21 CEST 2009

Dear Alex,

actually fixing the number of clusters in kmeans end then ending up with a 
smaller number because of empty clusters is not a standard method of 
estimating the number of clusters. I may happen (as apparently in some of 
your examples), but it is generally rather unusual. In most cases, kmeans, 
as well as clara, pam and other clustering methods, only give you the 
number of clusters you ask for. Even with some reasonable separation 
between clusters kmeans cannot generally be expected to come up with empty 
clusters if the number is initially chosen too high or too many 
initially centers are specified.

The help page for pam.object in library cluster shows you a method to 
estimate the optimal number of clusters based on pam.
However, this problem strongly depends on what cluster concept you have in 
mind and what you want to use your clusters for. There are alternative 
indexes that could be optimised to find the best number of clusters. Some 
of them are implemented in the function cluster.stats in package fpc.
I strongly advise reading some literature about this to understand the 
problem better; the help page of cluster.stats gives a few references.

The BIC gives you an estimate of the number of cluster together with 
Gaussian mixtures, see package mclust.

If you can specify things like maximum within-cluster distances, you may 
get something from using cutree together with a hierarchical clustering 
method in hclust, for example complete linkage.

dbscan and fixmahal in package fpc are further alternatives, requiring
one or two tuning constants to come up with an automatical number of
clusters.

Best regards,
Christian

On Thu, 11 Jun 2009, amvds at xs4all.nl wrote:

> I use kmeans to classify spectral events in high and low 1/3 octave bands:
>
> #Do cluster analysis
> CyclA<-data.frame(LlowA,LhghA)
> CntrA<-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
> ClstA<-kmeans(CyclA,centers=CntrA,nstart=50,algorithm="MacQueen")
>
> This works well when the actual data shows 1,2 or 3 groups that are not
> "too close" in a cross plot. The MacQueen algorithm will give one or more
> empty groups which is what I want.
>
> However, there are cases when the groups are closer together, less compact
> or diffuse which leads to the situation where visually only 2 groups are
> apparent but the algorithm returns 3 splitting one group in two.
>
> I looked at the package 'cluster' specifically at clara (cannot use pam as
> I have 10000 observations). But clara always returns as many groups as you
> aks for.
>
> Is there a way to help find a seed for the intial cluster centers?
> Equivalently, is there a way to find a priori the number of groups?
>
> I know this is not an easy problem. I have looked at principal components
> (princomp, prcomp) because there is a connection with cluster analysis. It
> is not obvious to me how to program that connection though.
>
> http://en.wikipedia.org/wiki/Principal_Component_Analysis
> http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
> http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf
>
> Thanks in advance,
> Alex van der Spek
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche