[R] kmeans error (bug?)

Mon Nov 10 18:28:22 CET 2003

Prof Brian Ripley wrote:
> 
> This is not a bug.  It just means that the algorithm sometimes finds an
> empty cluster, and as you asked for 34 clusters and it had 33 or less it
> stops.
> 
> What to do in this situation is currently under discussion, but the advice
> given is good: try another set of initial centres.

I am running kmeans in a loop for a range of possible cluster numbers.
The error terminates the loop. is there a mechanism by which I can
'trap' the error so that I can rerun kmeans with another set of initial
centers and hence allow the loop to run to completion. something like
try {} catch() mechanism of C++ for example. A flag for kmeans that
would have it return say a NULL value rather than an error would also
help in this type of application.

In fact, I wonder if anyone can point me to research, or better still R
functions/package/recipe, that help in choosing the best number of
clusters for the data. What I have tried so far is to do a manova using
the clustering result from kmeans, plot the approximate F statistic
and/or the p-value and look for cluster numbers where a sharp increase
in F or -log(pvalue) occur. what I would like to do but don't know how
is to formally compare successive clustering models. I know you can
compare models using the R function anova. but anova does not seem to
work with mlm models?

> 
> Please do read the description of a bug in the R FAQ, and do not misuse
> the term to mean `something I do not understand'.

This wasn't really a declaration that this behavior is a bug, rather it
was a question of whether it is (hence the question mark). I guess what
I found somewhat confusing is that if kmeans was selecting data points
at random as the initial cluster centers then, at least initially, non
of these clusters would start out empty. It wasn't immediately clear how
could further refinement result in clusters becoming empty.

thanks for the feedback

> 
> On Mon, 10 Nov 2003, Murad Nayal wrote:
> 
> > I have been getting the following intermittent error from kmeans:
> >
> > >str(cavint.p.r)
> >  num [1:1967, 1:13] 0.691 0.123 0.388 0.268 0.485 ...
> >  - attr(*, "dimnames")=List of 2
> >   ..$ : chr [1:1967] "6" "49" "87" "102" ...
> >   ..$ : chr [1:13] "HYD" "NEG" "POS" "OXY" ...
> > > set.seed(34)
> > > kmeans(cavint.p.r,centers=34)
> > Error: empty cluster: try a better set of initial centers
> >
> > the seed being equal to the number of centers in this case is just a
> > coincidence. I've encountered the same error with or without setting the
> > seed at different numbers of clusters.
> >
> > there is nothing particularly unusual about cavint.p.r (no NAs, NULLs),
> > except maybe for the fact that the rows sum to 1.
> >
> > > sum(is.na(cavint.p.r))
> > [1] 0
> > > sum(is.nan(cavint.p.r))
> > [1] 0
> > >
> >
> > I thought kmeans should select initial centers from the data if not
> > given explicitly! any idea what might be going wrong?
> 
> And what makes you think it did not?
> 
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Murad Nayal M.D. Ph.D.
Department of Biochemistry and Molecular Biophysics
College of Physicians and Surgeons of Columbia University
630 West 168th Street. New York, NY 10032
Tel: 212-305-6884	Fax: 212-305-6926