[R] cluster analysis for 80000 observations
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jan 27 09:31:07 CET 2006
>>>>> "Markus" == Markus Preisetanz <Markus.Preisetanz at clientvela.com>
>>>>> on Thu, 26 Jan 2006 20:48:29 +0100 writes:
Markus> Dear R Specialists,
Markus> when trying to cluster a data.frame with about 80.000 rows and 25 columns I get the above error message. I tried hclust (using dist), agnes (entering the data.frame directly) and pam (entering the data.frame directly). What I actually do not want to do is generate a random sample from the data.
Currently all the above mentioned cluster methods work with
full distance / dissimilarity objects, even if only internally,
i.e. they store all d_{i,j} for 1 <= i < j <= n, i.e. n(n-1)/2 values,
also each of them in double precision, i.e. 8 bytes.
So: no chance with the above functions and n=80'000
Markus> The machine I run R on is a Windows 2000 Server (Pentium 4) with 2 GB of RAM.
If you would run an machine with a 64-bit version of OS and R
{typical case today: Linux on AMD Opteron}, you could go up
quite a bit higher than on your Windoze box,
{I vaguely remember I could do 'n = a few thousand' on our
dual opteron with 16 GBytes}, but 80'000 is definitely too
large.
OTOH, there is clara() in the cluster package, which has been
designed for such situations,
CLARA:= [C]lustering [LAR]ge [A]pplications.
It is similar in spirit to pam(),
*does* cluster all 80'000 observations but does so by taking
sub samples to construct the medoids.
(and you can ask it to take many medium size subsamples, instead
of just 5 small sized ones as it does by default).
Martin Maechler, ETH Zurich
maintainer of "cluster" package.
More information about the R-help
mailing list