[R] what is used as height in hclust for ward linkage?
james.foadi at diamond.ac.uk
james.foadi at diamond.ac.uk
Fri Dec 2 16:03:58 CET 2011
Dear R community,
I am trying to understand how the ward linkage works from a quantitative point of view.
To test it I have devised a simple 3-members set:
G = c(0,2,10)
The distances between all couples are:
d(0,2) = 2
d(0,10) = 10
d(2,10) = 8
The smallest distance corresponds to merging 0 and 2. The corresponding ESS are:
ESS(0,2) = 2*var(c(0,2)) = 4
ESS(0,10) = 2*var(c(0,10)) = 100
ESS(2,10) = 2*var(c(2,10)) = 64
and, indeed, the smallest ESS corresponds to merging 0 and 2. The next element that should be added
to 0 and 2 is obviously 10. This is where I don't understand how the hclust algorithm in R works. We have
> G <- c(0,2,10)
> G.dist <- dist(G)
> G.hc <- hclust(G.dist,method="ward")
> G.hc$merge
[,1] [,2]
[1,] -1 -2
[2,] -3 1
> G.hc$height
[1] 2.00000 11.33333
Now, according to standard definitions, the distance between two clusters with elements Nr and Ns is:
d(Rs,Rr) = sqrt(2*Nr*Ns/(Nr+Ns))*||<Rs> - <Rr>||
where < > in the last expression indicates averages (centroids). If I carry out this operation to merge cluster
c(0,2) with 10, I get:
d(c(0,2),10) = sqrt(2*2*1/(2+1))*|1-9| = 9.237604
This is different from 11.3333 in the R output.
Does anyone know what's the exact value for the ward linkage, as displayed in the hclust height output?
Thanks in advance for any help!
J
--
This e-mail and any attachments may contain confidential...{{dropped:8}}
More information about the R-help
mailing list