[R] finding centroids of clusters created with hclust
Moritz Lennert
mlennert at club.worldonline.be
Sun May 14 17:08:04 CEST 2006
Dear Gavin,
Gavin Simpson wrote:
> On Wed, 2006-05-10 at 18:59 +0200, Moritz Lennert wrote:
>> Replying to myself for the record:
>>
>> Moritz Lennert wrote:
>>> Hello,
>>>
>>> Can someone point me to documentation or ideas on how to calculate the
>>> centroids of clusters identified with hclust ?
>>>
>>> I would like to be able to chose the number of clusters (in the style of
>>> cutree) and then get the centroids of these clusters.
>>>
>>> This seems like a quite obvious task to me, but I haven't been able to
>>> put my hands on a relevant command.
>
> Sorry, Moritz, I meant to reply to your original post, but deleted it
> from my emailer accidentally and hadn't had chance to use the archives
> to follow up.
>
> Anyway, Venables and Ripley's Modern Applied Statistics with S (4th Ed)
> [and earlier editions - it is in my 3rd Edition for example] has an
> example of doing what you want to do on page 318 of the 4th Edition.
> They use the centre's of the hclust results as starting points for a
> k-means, so we only need the preliminary bits of their example:
>
> library(MASS)
> swiss.x <- as.matrix(swiss)
> h <- hclust(dist(swiss.x), method = "average")
> initial <- tapply(swiss.x, list(rep(cutree(h, 3), ncol(swiss.x)),
> col(swiss.x)),
> mean)
> dimnames(initial) <- list(NULL, dimnames(swiss.x)[[2]])
> initial
>
> Which gives almost the same output as your function:
>
> fun <- function (data, clust) {
> nvars=length(data[1,])
> ntypes=max(clust)
> centroids<-matrix(0,ncol=nvars,nrow=ntypes)
> for(i in 1:ntypes) {
> c<-rep(0,nvars)
> n<-0
> for(j in names(clust[clust==i])) {
> n<-n+1
> c<-c+data[j,]
> }
> centroids[i,]<-c/n
> }
> rownames(centroids)<-c(1:ntypes)
> colnames(centroids)<-colnames(data)
> centroids
> }
>
> fun(swiss.x, cutree(h, 3))
>
> Wrapping the Venables & Ripley version into a function to give the same
> output as your function:
>
> ##
> ## clust.means - function to find centroids of clusters
> ## based on example by Venables & Ripley, MASS 4thEd, Page 318 [1]
> ##
> ## x = input data as data.frame or matrix
> ## res.clust = object of class "hclust"
> ## groups = number of groups to cut dendrogram into
> ##
> ## References:
> ##
> ## [1] Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics
> ## with S. 4th Edition. Springer.
> clust.means <- function(x, res.clust, groups)
> {
> if(!is.matrix(x))
> x <- as.matrix(x)
> means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
> col(x)),
> mean)
> dimnames(means) <- list(NULL, dimnames(x)[[2]])
> return(as.data.frame(means))
> }
>
> clust.means(swiss, h, 3)
I have a weird error happening here:
when I run the line
means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
col(x)), mean)
directly on the command line, it works. But when I try to run the
clust.means function, I get:
Error in rep(cutree(res.clust, groups), ncol(x)) :
could not find function "cutree"
> Your function is faster here:
>
>> system.time(for(i in 1:10000) fun(swiss.x, cutree(h, 3)))
> [1] 8.917 0.000 9.695 0.000 0.000
>> system.time(for(i in 1:10000) clust.means(swiss, h, 3))
> [1] 31.642 0.008 35.348 0.000 0.000
>
> But I think the example is instructive about using R. Sometimes
> vectorisation can make a big time saving over a loop - here it doesn't.
Yes, thank you very much. I will have to read up a bit more on tapply to
be able to understand it completely, but it is certainly a good learning
example.
Thanks,
Moritz
More information about the R-help
mailing list