[R] extracting groups from hclust() for a very large matrix

Milan Bouchet-Valat nalimilan at club.fr
Fri Oct 12 21:05:21 CEST 2012


Le vendredi 12 octobre 2012 à 11:33 -0700, Christopher R. Dolanc a
écrit :
> That command gives me the same result. Do you see that R is not listing 
> the plot numbers? Just all the numbers between 1 and 137, 138 and 310, 
> etc. It's like it has reordered the dendrogram, so that everything 
> occurs chronologically.
> 
> Instead, I would expect something like this:
> 
> [1]
> 3, 15, 48, 134, 136, 213, 299, .....
> 
> [2]
> 44, 67, 177, .....
Yeah, but that's a problem with your data or your dist function, not
with hclust() and cutree().

As always, it's good to try to find the minimal example that reproduces
the problem. Start from examples provided by ?cutree:
hc <- hclust(dist(USArrests))
cutree(hc, k=2)
       Alabama         Alaska        Arizona       Arkansas     California 
             1              1              1              2              1 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             2              2              1              1              2 

      etc.

Here you see the cluster numbers are not in sequence, and my command
shows groups correctly:
 split(rownames(USArrests), cutree(hc, 2))
$`1`
 [1] "Alabama"        "Alaska"         "Arizona"        "California"    
 etc.

$`2`
 [1] "Arkansas"      "Colorado"      "Connecticut"   "Georgia"      
 [5] "Hawaii"        "Idaho"         "Indiana"       "Iowa"         
 etc.  

So either your data is already ordered, or you have a problem with your
distance function. One guess: you have included the "Plot" column in the
call to vegdist(). I don't know this function, but it seems to work like
dist(), which means passing the plot id is plain wrong. Indeed, if I use
VTM.Dist<-vegdist(VTM.Matrix[,-1])
VTM.HClust<- hclust(VTM.Dist, method="ward")
VTM.8groups<- cutree(VTM.HClust, 8)
the result is not ordered as before.

Lesson: try with simple, standard data when complex data sets don't
work, and compare results.


My two cents




More information about the R-help mailing list