[R] cluster analysis for 80000 observations

Martin Maechler maechler at stat.math.ethz.ch
Fri Jan 27 09:31:07 CET 2006


>>>>> "Markus" == Markus Preisetanz <Markus.Preisetanz at clientvela.com>
>>>>>     on Thu, 26 Jan 2006 20:48:29 +0100 writes:

    Markus> Dear R Specialists,
    Markus> when trying to cluster a data.frame with about 80.000 rows and 25 columns I get the above error message. I tried hclust (using dist), agnes (entering the data.frame directly) and pam (entering the data.frame directly). What I actually do not want to do is generate a random sample from the data.

Currently all the above mentioned cluster methods work with
full distance / dissimilarity objects, even if only internally,
i.e. they store all d_{i,j} for  1 <= i < j <= n, i.e.  n(n-1)/2 values,
also each of them in double precision, i.e. 8 bytes.

So: no chance with the above functions and n=80'000

 Markus> The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB of RAM.

If you would run an machine with a 64-bit version of OS and R
{typical case today: Linux on AMD Opteron}, you could go up
quite a bit higher than on your Windoze box,
{I vaguely remember I could do  'n = a few thousand' on our 
 dual opteron with 16 GBytes}, but 80'000 is definitely too
large.

OTOH, there is clara() in the cluster package, which has been
designed for such situations, 
	 CLARA:= [C]lustering [LAR]ge [A]pplications.
It is similar in spirit to pam(),
*does* cluster all 80'000 observations but does so by taking
sub samples to construct the medoids.
(and you can ask it to take many medium size subsamples, instead
 of just 5 small sized ones as it does by default).

Martin Maechler, ETH Zurich
maintainer of "cluster" package.




More information about the R-help mailing list