Christian Hennig
fm3a004 at math.uni-hamburg.de
Wed Dec 15 13:16:35 CET 2004
Dear Dan,
I would think about transforming your columns in such a way (square
root, log?) that methods operating on n*p matrices and assuming
roughly elliptical within-clusters distributions such as kmeans or
clara, or, after dimension reduction, EMclust or fixmahal can be applied.
Maybe you can even do that on untransformed data (take a look at the
variable-wise distributions or 2-d scatterplots).
You do not need a distance matrix then.
Christian
On Wed, 15 Dec 2004, Dan Bolser wrote:
>
> Hi,
>
> I have ~40,000 rows in a database, each of which contains an id column and
> 20 additional columns of count data.
>
> I want to cluster the rows based on these count vectors.
>
> Their are ~1.6 billion possible 'distances' between pairs of vectors
> (cells in my distance matrix), so I need to do something smart.
>
> Can R somehow handle this?
>
> My first thought was to index the database with something that makes
> nearest neighbour lookup more efficient, and then use single linkage
> clustering. Is this kind of index implemented in R (by default when using
> single linkage)?
>
> Also 'grouping' identical vectors is very easy. I tried making groups more
> fuzzy by using a hashing function over the count vectors, but my hash was
> too crude. Any way to do fuzzy grouping in R which scales well?
>
> For example, removing identical vectors gives me ~30,000 rows (and ~900
> million pairs of distances). As an example of how fast I can group, the
> above query took 0.13 seconds in mysql (using an index over every element
> in the vector). However, if I tried to calculate a distance between every
> pair of non identical vectors (lets say I can calculate ~1000 eutlidian
> distances per second) it would take me ~10 days just to calculate the
> distance matrix.
>
> Sorry for all the information. Any suggestions on how to cluster such a
> huge dataset (using R) would be appreciated.
>
> Cheers,
> Dan.
>
>
