[R] Massive clustering job?
dmb at mrc-dunn.cam.ac.uk
Wed Dec 15 12:37:00 CET 2004
I have ~40,000 rows in a database, each of which contains an id column and
20 additional columns of count data.
I want to cluster the rows based on these count vectors.
Their are ~1.6 billion possible 'distances' between pairs of vectors
(cells in my distance matrix), so I need to do something smart.
Can R somehow handle this?
My first thought was to index the database with something that makes
nearest neighbour lookup more efficient, and then use single linkage
clustering. Is this kind of index implemented in R (by default when using
Also 'grouping' identical vectors is very easy. I tried making groups more
fuzzy by using a hashing function over the count vectors, but my hash was
too crude. Any way to do fuzzy grouping in R which scales well?
For example, removing identical vectors gives me ~30,000 rows (and ~900
million pairs of distances). As an example of how fast I can group, the
above query took 0.13 seconds in mysql (using an index over every element
in the vector). However, if I tried to calculate a distance between every
pair of non identical vectors (lets say I can calculate ~1000 eutlidian
distances per second) it would take me ~10 days just to calculate the
Sorry for all the information. Any suggestions on how to cluster such a
huge dataset (using R) would be appreciated.
More information about the R-help