[R] Cluster analysis, defining center seeds or number of clusters
Christian Hennig
chrish at stats.ucl.ac.uk
Thu Jun 11 18:41:21 CEST 2009
Dear Alex,
actually fixing the number of clusters in kmeans end then ending up with a
smaller number because of empty clusters is not a standard method of
estimating the number of clusters. I may happen (as apparently in some of
your examples), but it is generally rather unusual. In most cases, kmeans,
as well as clara, pam and other clustering methods, only give you the
number of clusters you ask for. Even with some reasonable separation
between clusters kmeans cannot generally be expected to come up with empty
clusters if the number is initially chosen too high or too many
initially centers are specified.
The help page for pam.object in library cluster shows you a method to
estimate the optimal number of clusters based on pam.
However, this problem strongly depends on what cluster concept you have in
mind and what you want to use your clusters for. There are alternative
indexes that could be optimised to find the best number of clusters. Some
of them are implemented in the function cluster.stats in package fpc.
I strongly advise reading some literature about this to understand the
problem better; the help page of cluster.stats gives a few references.
The BIC gives you an estimate of the number of cluster together with
Gaussian mixtures, see package mclust.
If you can specify things like maximum within-cluster distances, you may
get something from using cutree together with a hierarchical clustering
method in hclust, for example complete linkage.
dbscan and fixmahal in package fpc are further alternatives, requiring
one or two tuning constants to come up with an automatical number of
clusters.
Best regards,
Christian
On Thu, 11 Jun 2009, amvds at xs4all.nl wrote:
> I use kmeans to classify spectral events in high and low 1/3 octave bands:
>
> #Do cluster analysis
> CyclA<-data.frame(LlowA,LhghA)
> CntrA<-matrix(c(0.9,0.8,0.8,0.75,0.65,0.65), nrow = 3, ncol=2, byrow=TRUE)
> ClstA<-kmeans(CyclA,centers=CntrA,nstart=50,algorithm="MacQueen")
>
> This works well when the actual data shows 1,2 or 3 groups that are not
> "too close" in a cross plot. The MacQueen algorithm will give one or more
> empty groups which is what I want.
>
> However, there are cases when the groups are closer together, less compact
> or diffuse which leads to the situation where visually only 2 groups are
> apparent but the algorithm returns 3 splitting one group in two.
>
> I looked at the package 'cluster' specifically at clara (cannot use pam as
> I have 10000 observations). But clara always returns as many groups as you
> aks for.
>
> Is there a way to help find a seed for the intial cluster centers?
> Equivalently, is there a way to find a priori the number of groups?
>
> I know this is not an easy problem. I have looked at principal components
> (princomp, prcomp) because there is a connection with cluster analysis. It
> is not obvious to me how to program that connection though.
>
> http://en.wikipedia.org/wiki/Principal_Component_Analysis
> http://ranger.uta.edu/~chqding/papers/Zha-Kmeans.pdf
> http://ranger.uta.edu/~chqding/papers/KmeansPCA1.pdf
>
> Thanks in advance,
> Alex van der Spek
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
More information about the R-help
mailing list