[R] Question on Cluster Package, agnes() function

Martin Maechler maechler at stat.math.ethz.ch
Sun May 4 00:28:16 CEST 2014


>>>>> Anna F...
>>>>>     on Thu, 1 May 2014 22:09:28 +0000 writes:

    > Hi Martin,
    > I am a statistician at National Jewish Health in Colorado, and I have been working on clustering a dataset using Ward's minimum variance. When plotting the dendrogram, the y-axis is labeled as 'height'. Can you explain to me (or point me in the right direction) on how this distance between merging clusters is calculated for the Ward method? I have found the calculation that SAS uses, and I want to check if it is the same in your method.

    > Here is a summary of the code I am using:
    > Agnes(x,method="ward",diss=TRUE)

Well, as R is case sensitive, it must be

     agnes(x,method="ward",diss=TRUE)


Interestingly, the new version of R, R 3.1.0  has now two
different versions of Ward in  hclust() :

 --> http://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html

where it is stated that previously it was basically not using
Ward's method unless the user was calling it in a specific way,
but  agnes() was and is.

*The* reference for all basic routines in the 'cluster'  package is

     Kaufman, L. and Rousseeuw, P.J. (1990).  _Finding
     Groups in Data: An Introduction to Cluster Analysis_.  
     Wiley, New York.

Alternatively, the source code of R and all packages is open,
and for the cluster package, you can either get it from
cluster_*.tar.gz from CRAN, or also you can see the (subversion)
development version at http://svn.r-project.org/

Specifically, the C code which computes agnes()  is

  https://svn.r-project.org/R-packages/trunk/cluster/src/twins.c

and there,

         case 4: /*     4: ward's method */
		ta = (double) kwan[la];
		tb = (double) kwan[lb];
		tq = (double) kwan[lq];
		fa = (ta + tq) / (ta + tb + tq);
		fb = (tb + tq) / (ta + tb + tq);
		fc = -tq / (ta + tb + tq);
		int nab = ind_2(la, lb);
		dys[naq] = sqrt(fa * dys[naq] * dys[naq] +
				fb * dys[nbq] * dys[nbq] +
				fc * dys[nab] * dys[nab]);
		break;

contains the distance calculation for ward.

...
[ in private communication with Anna, she agreed that I reply
  publicly to R-help such that others can chime in and all will be
  searchable for people with a similar question. MM ]

Best regards,
Martin Maechler



More information about the R-help mailing list