[R] Kmeans centers

Fri Mar 30 09:53:53 CEST 2007

On Fri, 2007-03-30 at 09:07 +0200, Sergio Della Franca wrote:
> My simple problem is that when i run kmeans this give me different
> results because if centers is a number, a random set of (distinct)
> rows in x is chosen as the initial centres.

You can stop this and make it reproducible by setting the seed for the
random number generator before doing kmeans - this way the same
(pseudo)random set of rows get selected each time:

dat <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
set.seed(1234)
km <- kmeans(dat, 2)
set.seed(1234)
km2 <- kmeans(dat, 2)
all.equal(km, km2) ## TRUE

But ask yourself is this is helpful? Are the solutions similar each time
you run the function (without setting the seed) and get different
results? If the runs give very different results then it is likely that
you are finding local minima not an optimal solution - a common problem
with iterative algorithms using random starts.

One solution to this /is/ to use several random starts and see if you
get similar results. Some samples may switch clusters, but if the bulk
of samples assigned to same cluster (i.e. together, not in cluster "1"
as the cluster number is random) then you can be happy with the result.
That some samples switch clusters may just indicate that there isn't a
clearly defined clustering of all your samples - some are intermediate
between clusters.

Another is to use a hierarchical cluster analysis (via hclust()). Cut it
at the number of clusters you want and use the centers (sic) of those
clusters as the starting points for kmeans. This way the hclust()
results get you close to a good solution, which kmeans then updates as
it is not constrained by having a hierarchical structure.

There is an example of this in Modern Applied Statistics with S (2002 -
Venables and Ripley, Springer), but if you don't have this book, you can
see the MASS scripts for Chapter 11 of the book. The MASS scripts should
have been provided with your copy of R, in
RINSTALL/library/MASS/scripts/ where RINSTALL is the where your version
of R is installed. Then you want ch11.R in that directory. Look at
section 11.2 Cluster Analysis in that file

>  
> About me the problem is simple. 
>  
> The question i ask you is if it possible that centers could be
> different from number. 
> i.e. instead of indicate a number of center, could be possible
> indicate different character lable to identify the cluster i want to
> obtain?

No. And this is why, despite how clear and simple the problem is to you,
you need to show us an example of your data! Surly, if you have
information that exactly identifies the clusters you want to find, why
do you need a clustering algorithm to find them for you?

G

>    
> thk you
> 
> 
>  
> 2007/3/29, Gavin Simpson <gavin.simpson at ucl.ac.uk>: 
>         On Thu, 2007-03-29 at 15:02 +0200, Sergio Della Franca wrote:
>         > Dear R-Helpers,
>         >
>         > I read in the R documentation, about kmeans: 
>         >
>         >   centers
>         >
>         > Either the number of clusters or a set of initial (distinct)
>         cluster
>         > centres. *If a number*, a random set of (distinct) rows in x
>         is chosen as
>         > the initial centres. 
>         > My question is: could it be possible that the centers are
>         character and not
>         > number?
>         
>         I think you misunderstand - centers is the number of clusters
>         you want
>         to partition your data into. How else would you specify the
>         number of 
>         clusters other than by a number? So no, it has to be a numeric
>         number.
>         
>         The alternative use of centers is to provide known starting
>         points for
>         the algorithm, such as from the results of a hierarchical
>         cluster 
>         analysis, that are the locations of the cluster centroids, for
>         each
>         cluster, on each of the feature variables.
>         
>         Also, argument x to kmeans() is specific about requiring a
>         numeric
>         matrix (or something coercible to one), so characters here are
>         not 
>         allowed either.
>         
>         But then again, I may not have understood what it is that you
>         are
>         asking, but that is not surprising given that you have not
>         provided an
>         example of what you are trying to do, and how you tried to do
>         it but 
>         failed.
>         
>         > and provide commented, minimal, self-contained, reproducible
>         code.
>         
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>         G
>         --
>         %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
>         %~%~%~% 
>         Gavin Simpson                 [t] +44 (0)20 7679 0522
>         ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
>         Pearson Building,             [e]
>         gavin.simpsonATNOSPAMucl.ac.uk
>         Gower Street, London          [w]
>         http://www.ucl.ac.uk/~ucfagls/
>         UK. WC1E 6BT.                 [w]
>         http://www.freshwaters.org.uk
>         %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
>         %~%~%~%
>         
> 
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [t] +44 (0)20 7679 0522
ECRC                              [f] +44 (0)20 7679 0565
UCL Department of Geography
Pearson Building                  [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street
London, UK                        [w] http://www.ucl.ac.uk/~ucfagls/
WC1E 6BT                          [w] http://www.freshwaters.org.uk/
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%