[R] Massive clustering job?

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Sat Dec 18 00:09:13 CET 2004


On Wed, 15 Dec 2004, Wiener, Matthew wrote:

>It sounds like "clara" in package cluster might help.

Cheers, this looks just the ticket. How should I choose k though?

Dan.


>
>Regards,
>
>Matt Wiener
>
>-----Original Message-----
>From: r-help-bounces at stat.math.ethz.ch
>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser
>Sent: Wednesday, December 15, 2004 6:37 AM
>To: R mailing list
>Subject: [R] Massive clustering job?
>
>
>
>Hi, 
>
>I have ~40,000 rows in a database, each of which contains an id column and
>20 additional columns of count data.
>
>I want to cluster the rows based on these count vectors.
>
>Their are ~1.6 billion possible 'distances' between pairs of vectors
>(cells in my distance matrix), so I need to do something smart.
>
>Can R somehow handle this?
>
>My first thought was to index the database with something that makes
>nearest neighbour lookup more efficient, and then use single linkage
>clustering. Is this kind of index implemented in R (by default when using
>single linkage)?
>
>Also 'grouping' identical vectors is very easy. I tried making groups more
>fuzzy by using a hashing function over the count vectors, but my hash was
>too crude. Any way to do fuzzy grouping in R which scales well?
>
>For example, removing identical vectors gives me ~30,000 rows (and ~900
>million pairs of distances). As an example of how fast I can group, the
>above query took 0.13 seconds in mysql (using an index over every element
>in the vector). However, if I tried to calculate a distance between every
>pair of non identical vectors (lets say I can calculate ~1000 eutlidian
>distances per second) it would take me ~10 days just to calculate the
>distance matrix.
>
>Sorry for all the information. Any suggestions on how to cluster such a
>huge dataset (using R) would be appreciated.
>
>Cheers,
>Dan.
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
>http://www.R-project.org/posting-guide.html
>
>
>
>------------------------------------------------------------------------------
>Notice:  This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as Banyu) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message.  If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.
>------------------------------------------------------------------------------
>




More information about the R-help mailing list