[R] cluster/distance large matrix
Christian Hennig
chrish at stats.ucl.ac.uk
Thu Feb 11 14:41:47 CET 2010
Dear Bart,
a strange thing in your question is that the term "Ward's method"
usually refers to a method based on the k-means criterion, which, in its
standard form, is not based on dissimilarities, but on
"objects*variables-data".
So I wonder how and why you want to use Ward's method on a dissimilarity
matrix in the first place (I know that the "k-means" criterion
can in principle be translated to dissimilarity data - this is probably
what hclust's method="ward" does if fed with a dissimilarity matrix, but
I'm not sure -, but then it loses its justification).
One thing you could think about is using the function pam in library
cluster. Chances are that this won't work on 38,000 cases either, but you
may cluster a subsample of, say, 2,000 cases and assign all further
objects to the most similar cluster medoid.
It is well know that hierarchical methods are problematic with too large
dissimilarity matrices; even if you resolve the memory problem, the number
of operations required is enormous.
Hope this helps,
Christian
On Thu, 11 Feb 2010, Bart Thijs wrote:
>
> Hi all,
>
> I've stumbled upon some memory limitations for the analysis that I want to
> run.
>
> I've a matrix of distances between 38000 objects. These distances were
> calculated outside of R.
> I want to cluster these objects.
>
> For smaller sets (egn=100) this is how I proceed:
> A<-matrix(scan(file, n=100*100),100,100, byrow=TRUE)
> ad<-as.dist(A)
> ahc<-hclust(ad,method="ward",members=NULL)
> ....
>
> However if I try this with the real dataset I end up with memory problems.
> I've the 64bit version of R installed on a machine with 40Gb RAM (Windows
> 2003 64bit version).
>
> I'm thinking about using only the lower triangle of the matrix but I can't
> create a distance object for the clustering from the lower.tri
>
> Can someone help me with a suggestion for which way to go?
>
> Best Regards
> Bart Thijs
> --
> View this message in context: http://n4.nabble.com/cluster-distance-large-matrix-tp1477237p1477237.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
More information about the R-help
mailing list