[BioC] ctc package - cluster dendrogram

Jarno Tuimala jtuimala at csc.fi
Wed Oct 17 07:20:49 CEST 2007


Hi!

If you draw a dendrogram in R, the y-axis is the distance between objects. 
In your case, the tree looks roughly like:

4  +--+
|  |  |
3  1  |
|    +-+
2    | |
      2 3

As the branch which connects V2 and V3 is at approx. 2.4 it is the 
distance between these objects (samples). The same applies to the 
distance between samples V1 and V3 (or V2 and V3). Those connect at 
approx. 3.9, and that is the distance between these samples. You can plot 
the tree using

plot(hc, hang=0)

and this should become more evident.

This is contrast to Treeview that visualizes the distances as branch 
lengths. If you visualize the tree in Treeview (by Rod Page), the branches 
are the euclidean distances between the samples, and are not equidistant. 
For example, the distance between V2 and V3 is approx. 2.4. In the tree 
drawn by Treeview, the branch lengths are half of that, so each terminal 
branch leading to either V2 or V3 is about 1.2.

You also asked about the 0.752... distances in the tree:

> 'hclust_12_probes_newick' file contains:
> (V1:0.752346233726435,(V2:1.21282408894056,V3:1.21282408894056):
> 0.752346233726435);

The first is the lenght of branch leading to V1, another is the length of 
the only internal branch of the tree. Those are computed from the pairwise 
distances between samples using the average linkage (UPGMA) algorithm.

- Jarno


On Mon, 15 Oct 2007, Donna Toleno wrote:

> Hello list.
>
> When I make an R Cluster Dendrogram, it looks very different from the clustering in the Newick file displayed in Treeview (Rod Page program) . I tried a simple example with 12 probes and 3 samples and I did the Euclidean distances manually and with R.
>
>
>> library(ctc)
>> data
>         V1       V2       V3
> 1  4.184499 4.142575 4.017366
> 2  3.459849 3.455023 3.732115
> 3  8.287278 4.887692 5.007794
> 4  4.137224 4.523774 4.191996
> 5  4.431768 4.356945 4.570331
> 6  3.867442 3.931225 3.967566
> 7  3.480681 3.609997 3.522618
> 8  3.460785 3.966638 3.708675
> 9  4.306729 4.480724 4.399165
> 10 4.290001 4.036634 4.078688
> 11 6.707544 7.179901 9.475103
> 12 6.837264 6.845438 7.364477
>> hc <- hcluster(t(data), link = "ave")
>> write(hc2Newick(hc),file='hclust_12_probes_newick')
>> plot (hc)
>> hc
>
> Call:
> hcluster(x = t(data), link = "ave")
>
> Cluster method   : average
> Distance         : euclidean
> Number of objects: 3
>
> 'hclust_12_probes_newick' file contains:
> (V1:0.752346233726435,(V2:1.21282408894056,V3:1.21282408894056):0.752346233726435);
>
> I can see that the above Newick formatted tree shows that sample 2 and sample 3 are the appropriate distance apart, about 2.4, but where does the 0.7523... come from? How do I interpret  "Height" on the y-axis of this dendrogram? I would like a tree that represents the expression difference. The Newick tree viewed in TreeView (Rod Page's Treeview)  looks different from the dendrogram produced by hcluster, but the branch lengths still do not reflect the Euclidean distances. In my example, the Newick tree shows all three samples about equidistant from each other.  Perhaps I should be using phylogenetic tree drawing to get the appropriate branch lengths from the Euclidean distances? I also experimented with hclust2treeview but this seems to refer to Michael Eisen's Treeview. I am not familiar with this program or the file formats it uses.
>
> Thank you for reading. Any comments will be appreciated.
>
> Euclidean distance manually calculated in Excel for all of the 12 probes:
>
> 		V2	     	V3
> V1   	3.508320996	4.352360295
> V2		   	             2.425648178
>
>> distances.12.probes <- as.matrix(dist(t(data), method = "euclidean", diag = FALSE))
>> distances.12.probes
>         V1       V2       V3
> V1 0.000000 3.508321 4.352360
> V2 3.508321 0.000000 2.425648
> V3 4.352360 2.425648 0.000000
>
>
> Thank you again.
>
> -Donna
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-----------------------------------------------------------------------------
Jarno Tuimala, FT, bioinformatiikan asiantuntija, CSC, PL 405, 02101 Espoo 
puh.: (09) 457 2226, fax: (09) 457 2302, s-posti: jarno.tuimala at csc.fi
CSC on tieteen tietotekniikan keskus, http://www.csc.fi/molbio

Jarno Tuimala, PhD, bioinformatics, CSC, P.O.Box 405, FI-02101 Espoo, Finland 
tel.: +358 9 457 2226, fax: +358 9 457 2302, e-mail: jarno.tuimala at csc.fi
CSC is the Finnish IT Center for Science, http://www.csc.fi/molbio



More information about the Bioconductor mailing list