[R] finding centroids of clusters created with hclust

Gavin Simpson gavin.simpson at ucl.ac.uk
Wed May 10 20:17:35 CEST 2006


On Wed, 2006-05-10 at 18:59 +0200, Moritz Lennert wrote:
> Replying to myself for the record:
> 
> Moritz Lennert wrote:
> > Hello,
> > 
> > Can someone point me to documentation or ideas on how to calculate the 
> > centroids of clusters identified with hclust ?
> > 
> > I would like to be able to chose the number of clusters (in the style of 
> > cutree) and then get the centroids of these clusters.
> > 
> > This seems like a quite obvious task to me, but I haven't been able to 
> > put my hands on a relevant command.

Sorry, Moritz, I meant to reply to your original post, but deleted it
from my emailer accidentally and hadn't had chance to use the archives
to follow up.

Anyway, Venables and Ripley's Modern Applied Statistics with S (4th Ed)
[and earlier editions - it is in my 3rd Edition for example] has an
example of doing what you want to do on page 318 of the 4th Edition.
They use the centre's of the hclust results as starting points for a
k-means, so we only need the preliminary bits of their example:

library(MASS)
swiss.x <- as.matrix(swiss)
h <- hclust(dist(swiss.x), method = "average")
initial <- tapply(swiss.x, list(rep(cutree(h, 3), ncol(swiss.x)),
                                col(swiss.x)),
                  mean)
dimnames(initial) <- list(NULL, dimnames(swiss.x)[[2]])
initial

Which gives almost the same output as your function:

fun <-  function (data, clust) {
  nvars=length(data[1,])
  ntypes=max(clust)
  centroids<-matrix(0,ncol=nvars,nrow=ntypes)
  for(i in 1:ntypes) {
    c<-rep(0,nvars)
    n<-0
    for(j in names(clust[clust==i])) {
      n<-n+1
      c<-c+data[j,]
    }
    centroids[i,]<-c/n
  }
  rownames(centroids)<-c(1:ntypes)
  colnames(centroids)<-colnames(data)
  centroids
}

fun(swiss.x, cutree(h, 3))

Wrapping the Venables & Ripley version into a function to give the same
output as your function:

##
## clust.means - function to find centroids of clusters
## based on example by Venables & Ripley, MASS 4thEd, Page 318 [1]
##
## x            = input data as data.frame or matrix
## res.clust    = object of class "hclust"
## groups       = number of groups to cut dendrogram into
##
## References:
##
## [1] Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics 
##     with S. 4th Edition. Springer.
clust.means <- function(x, res.clust, groups)
  {
    if(!is.matrix(x))
      x <- as.matrix(x)
    means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
                                  col(x)),
                    mean)
    dimnames(means) <- list(NULL, dimnames(x)[[2]])
    return(as.data.frame(means))
  }

clust.means(swiss, h, 3)

Your function is faster here:

> system.time(for(i in 1:10000) fun(swiss.x, cutree(h, 3)))
[1] 8.917 0.000 9.695 0.000 0.000
> 
> system.time(for(i in 1:10000) clust.means(swiss, h, 3))
[1] 31.642  0.008 35.348  0.000  0.000

But I think the example is instructive about using R. Sometimes
vectorisation can make a big time saving over a loop - here it doesn't.

HTH

G

> 
> Here's a simple function that does the job for me:
> 
> Variables:
> 
> data: matrix of original (absolute value) data introduced into hclust or 
> HierClust
> clust: result of a 'cutree' call on the results of the hclust or 
> HierClust call
> 
> Value:
> 
> a matrix of relative values of the variables at the centroids of the types
> 
> 
> function (data, clust) {
>    nvars=length(data[1,])
>    ntypes=max(clust)
>    centroids<-matrix(0,ncol=nvars,nrow=ntypes)
>    for(i in 1:ntypes) {
>       c<-rep(0,nvars)
>       n<-0
>       for(j in names(clust[clust==i])) {
>          n<-n+1
>          c<-c+data[j,]
>       }
>       centroids[i,]<-c/n
>    }
>    rownames(centroids)<-c(1:ntypes)
>    colnames(centroids)<-colnames(data)
>    centroids
> }
> 
> Moritz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
*  Note new Address, Telephone & Fax numbers from 6th April 2006  *
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson
ECRC & ENSIS                  [t] +44 (0)20 7679 0522
UCL Department of Geography   [f] +44 (0)20 7679 0565
Pearson Building              [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street                  [w] http://www.ucl.ac.uk/~ucfagls/cv/
London, UK.                   [w] http://www.ucl.ac.uk/~ucfagls/
WC1E 6BT.
>%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%




More information about the R-help mailing list