[R] kmeans function

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Mar 27 07:45:03 CET 2014


On 26/03/2014 20:01, Tomassini, Letizia wrote:
> I would like to understand why the fastclus procedure in SAS is affected by the initial order of the data. So, with the same dataset, but sorted in a different way, I get different clusters rearrangements. I find this really disturbing. R seems to find the stable solution with the use of nstart=100 but I do not know how R does this and I do not know how to replicate this in SAS. All I know so far is that proc fastclus uses k-means as well.
> Regarding R, for example, does the R software have a way of choosing always the same starting seeds? Does it reorganize the dataset according to an internal way of sorting the data before running kmeans?
> I am interested in finding clusters with the best global minima and extract the seeds out of those. I need those seeds for following clustering number solutions (for example decide for lower number of clusters and use specific seeds). Overall I am better at using SAS, and I am trying to learn this piece of clustering design information from R to implement that in SAS.
>
>
> Please let me know if you can help

We (unlike SAS) provide you with source code, which is the definitive 
documentation.  Please read it: it answers all your questions.  (Even 
those who contributed to the implementation of kmeans would need to do 
so to refresh their memories.)

As for why a SAS algorithm works the way it does: given the fees someone 
is paying SAS on your behalf they should be willing to explain.

>
> Letizia
>
>
>
> ________________________________________
> Da: r-help-bounces at r-project.org [r-help-bounces at r-project.org] per conto di Ranjan Maitra [maitra.mbox.ignored at inbox.com]
> Inviato: mercoledì 26 marzo 2014 12.48
> A: r-help at stat.math.ethz.ch
> Oggetto: Re: [R] kmeans function
>
> On Wed, 26 Mar 2014 18:35:34 +0000 "Tomassini, Letizia"
> <tomassini at vetmed.wsu.edu> wrote:
>
>>
>> Hello
>> I need to ask questions about the k-means clustering function. Mainly I would like to know why, with the use of nstart=enough number of times, kmeans always finds the same clustering arrangements; and this happens even when the input dataset is sorted in different ways or I take out few observations. I cannot seem to be able to recreate that when using SAS.
>
> Do you understand what kmeans does? Why would you expect otherwise?
> Besides, why does the function ahve to match SAS's output? (Do you
> know how it goes about initializing the function in SAS?) In any
> case, should it not be that it should provide the correct (best global
> minima, if possible) answer?
>
> Ranjan


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list