[R] cluster/distance large matrix

Christian Hennig chrish at stats.ucl.ac.uk
Thu Feb 11 14:41:47 CET 2010


Dear Bart,

a strange thing in your question is that the term "Ward's method" 
usually refers to a method based on the k-means criterion, which, in its 
standard form, is not based on dissimilarities, but on 
"objects*variables-data".
So I wonder how and why you want to use Ward's method on a dissimilarity 
matrix in the first place (I know that the "k-means" criterion 
can in principle be translated to dissimilarity data - this is probably 
what hclust's method="ward" does if fed with a dissimilarity matrix, but 
I'm not sure -, but then it loses its justification).

One thing you could think about is using the function pam in library 
cluster. Chances are that this won't work on 38,000 cases either, but you 
may cluster a subsample of, say, 2,000 cases and assign all further 
objects to the most similar cluster medoid.

It is well know that hierarchical methods are problematic with too large 
dissimilarity matrices; even if you resolve the memory problem, the number 
of operations required is enormous.

Hope this helps,
Christian


On Thu, 11 Feb 2010, Bart Thijs wrote:

>
> Hi all,
>
> I've stumbled upon some memory limitations for the analysis that I want to
> run.
>
> I've a matrix of distances between 38000 objects. These distances were
> calculated outside of R.
> I want to cluster these objects.
>
> For smaller sets (egn=100) this is how I proceed:
> A<-matrix(scan(file, n=100*100),100,100, byrow=TRUE)
> ad<-as.dist(A)
> ahc<-hclust(ad,method="ward",members=NULL)
> ....
>
> However if I try this with the real dataset I end up with memory problems.
> I've the 64bit version of R installed on a machine with 40Gb RAM (Windows
> 2003 64bit version).
>
> I'm thinking about using only the lower triangle of the matrix but I can't
> create a distance object for the clustering from the lower.tri
>
> Can someone help me with a suggestion for which way to go?
>
> Best Regards
> Bart Thijs
> --
> View this message in context: http://n4.nabble.com/cluster-distance-large-matrix-tp1477237p1477237.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche



More information about the R-help mailing list