[R] cluster/distance large matrix (fwd)

Thomas Lumley tlumley at u.washington.edu
Thu Feb 11 16:13:00 CET 2010

On Thu, 11 Feb 2010, Christian Hennig wrote:

>It is well know that hierarchical methods are problematic with too large 
>dissimilarity matrices; even if you resolve the memory problem, the number of 
>operations required is enormous.

There is at least one exception to this. Single-linkage hierarchical clustering with a convex distance such as Euclidean distance is feasible for quite large data sets using algorithms for the Euclidean minimum spanning tree. For tens to hundreds of thousands of points (flow cytometry data) the algorithm in the nnclust package is competitive in speed with model-based clustering (on a 32-bit system).  It's slower than pam(), but it is deterministic.

This doesn't apply to the original question, of course.


