[R] finding centroids of clusters created with hclust

Sun May 14 17:08:04 CEST 2006

Dear Gavin,

Gavin Simpson wrote:
> On Wed, 2006-05-10 at 18:59 +0200, Moritz Lennert wrote:
>> Replying to myself for the record:
>>
>> Moritz Lennert wrote:
>>> Hello,
>>>
>>> Can someone point me to documentation or ideas on how to calculate the 
>>> centroids of clusters identified with hclust ?
>>>
>>> I would like to be able to chose the number of clusters (in the style of 
>>> cutree) and then get the centroids of these clusters.
>>>
>>> This seems like a quite obvious task to me, but I haven't been able to 
>>> put my hands on a relevant command.
> 
> Sorry, Moritz, I meant to reply to your original post, but deleted it
> from my emailer accidentally and hadn't had chance to use the archives
> to follow up.
> 
> Anyway, Venables and Ripley's Modern Applied Statistics with S (4th Ed)
> [and earlier editions - it is in my 3rd Edition for example] has an
> example of doing what you want to do on page 318 of the 4th Edition.
> They use the centre's of the hclust results as starting points for a
> k-means, so we only need the preliminary bits of their example:
> 
> library(MASS)
> swiss.x <- as.matrix(swiss)
> h <- hclust(dist(swiss.x), method = "average")
> initial <- tapply(swiss.x, list(rep(cutree(h, 3), ncol(swiss.x)),
>                                 col(swiss.x)),
>                   mean)
> dimnames(initial) <- list(NULL, dimnames(swiss.x)[[2]])
> initial
> 
> Which gives almost the same output as your function:
> 
> fun <-  function (data, clust) {
>   nvars=length(data[1,])
>   ntypes=max(clust)
>   centroids<-matrix(0,ncol=nvars,nrow=ntypes)
>   for(i in 1:ntypes) {
>     c<-rep(0,nvars)
>     n<-0
>     for(j in names(clust[clust==i])) {
>       n<-n+1
>       c<-c+data[j,]
>     }
>     centroids[i,]<-c/n
>   }
>   rownames(centroids)<-c(1:ntypes)
>   colnames(centroids)<-colnames(data)
>   centroids
> }
> 
> fun(swiss.x, cutree(h, 3))
> 
> Wrapping the Venables & Ripley version into a function to give the same
> output as your function:
> 
> ##
> ## clust.means - function to find centroids of clusters
> ## based on example by Venables & Ripley, MASS 4thEd, Page 318 [1]
> ##
> ## x            = input data as data.frame or matrix
> ## res.clust    = object of class "hclust"
> ## groups       = number of groups to cut dendrogram into
> ##
> ## References:
> ##
> ## [1] Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics 
> ##     with S. 4th Edition. Springer.
> clust.means <- function(x, res.clust, groups)
>   {
>     if(!is.matrix(x))
>       x <- as.matrix(x)
>     means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
>                                   col(x)),
>                     mean)
>     dimnames(means) <- list(NULL, dimnames(x)[[2]])
>     return(as.data.frame(means))
>   }
> 
> clust.means(swiss, h, 3)

I have a weird error happening here:

when I run the line

means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)), 
col(x)), mean)

directly on the command line, it works. But when I try to run the 
clust.means function, I get:

Error in rep(cutree(res.clust, groups), ncol(x)) :
         could not find function "cutree"

> Your function is faster here:
> 
>> system.time(for(i in 1:10000) fun(swiss.x, cutree(h, 3)))
> [1] 8.917 0.000 9.695 0.000 0.000
>> system.time(for(i in 1:10000) clust.means(swiss, h, 3))
> [1] 31.642  0.008 35.348  0.000  0.000
> 
> But I think the example is instructive about using R. Sometimes
> vectorisation can make a big time saving over a loop - here it doesn't.

Yes, thank you very much. I will have to read up a bit more on tapply to 
be able to understand it completely, but it is certainly a good learning 
example.

Thanks,

Moritz