[Rd] hclust() and agnes() method="average" divergence (PR#3648)
maechler at stat.math.ethz.ch
maechler at stat.math.ethz.ch
Thu Aug 14 10:25:27 MEST 2003
>>>>> "MikG" == m grum <m.grum at cgiar.org>
>>>>> on Mon, 4 Aug 2003 08:51:30 +0200 (MET DST) writes:
MikG> Anyone have a clue why hclust() and agnes() produce
MikG> different results in the example below when both use
MikG> method="average"?? I'm not able to reproduce the
MikG> problem with other datasets.
MikG> ereck <- read.table("Ereck.txt",header=TRUE,sep="\t")
MikG> emol <- subset(ereck,select=c(11:18,20:32))
MikG> library(cluster)
MikG> library(mva)
MikG> daisemol <- daisy(emol,type=list(asymm=c(1:21)))
The reason is that most of the distances/dissimilarities are the
same: there are only 20 different values in the 1326 distances.
> sort(table(daisemol), decreasing=TRUE)
starts as
>> 0.666666666666667 0.5 0.8 0.285714285714286
>> 387 284 251 94
i.e. the distance 2/3 appears 387 times, 1/2 does 284 times, etc.
With so many ties in the distances, choosing the next
observation / cluster for "merging" is often chosing among many
possibilities and hence the arbitrariness and the difference
between too algorithms.
For your situation, you might be able to use some continuous
variable along with the factors and the many binary ones such
that the distances won't have ties.
NO bug! {i.e. you should have posted to R-help (you did have a
good question!)} not R-bugs.
Regards,
Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
More information about the R-devel
mailing list