[BioC] Correct use of a distance measure when clustering gene expression data

Fri Sep 3 14:42:01 CEST 2004

Hi Mick

I think it depends on the kind of similarity that is important to you.

-If you think it is important that genes that show parallel profiles 
are clustered together, use pearsons correlation coefficient. In this 
case two genes that peak at the same moment in time, but at a (very) 
different height, will be found in the same cluster.
-If you on the other hand think that it is important that genes which 
have similar extent of regulation are clustered together, use Euclidian 
distance. This clusters together genes of which the peaks occur at 
roughly the same height, but of which the profiles are not necessarily 
parallel.

So it depends on your question. For timecourse data, i'd say Pearsons 
correlation coefficient gives more relevant data. We don't really know 
how much of a gene product is necessary for a biological effect anyway, 
and moreover the amount of active protein in a cell is dependent on a 
lot more than just number of mRNA molecules and we have no way of 
looking at that with a microarray. So i think the shape of the curves 
are more important than the amplitude.

Furthermore, if i were you, i would subtract the log values of the 
ref-t0 comparison from all other ref-tx comparisons in your first 
dataset so that the values in your two different datasets are 
comparable and reflect gene regulation compared to timepoint 0. It 
would make it easier to get your head around what the numbers on your 
screen actually mean.

This is all from a biologist so consult with a mathematician as well!

Hope this is of use to you.
Floor

_______________________________________________________
Floor Stam

Vrije Universiteit Amsterdam
Faculty of Earth and Life Sciences
Department of Molecular and Cellular Neurobiology
De Boelelaan 1085
1081HV Amsterdam
The Netherlands

Ph: 	+31-20-4447114
	+31-20-5665512
Fax: 	+31-20-4447112
e-mail: fjstam at bio.vu.nl
_______________________________________________________
On 2 Sep 2004 , at 17:38, michael watson (IAH-C) wrote:

> Hi
>
> I have two different data sets, both time-courses.  One uses a common
> reference for the Cy3 channel, the other performs direct comparisons
> between treated/untreated samples at each time-point.  In both cases 
> the
> actual data is log2(Cy5/Cy3).
>
> After a bit of thought, I've come to the conclusion that as a distance
> measure for the first dataset I will use "1 - pearson correlation
> coefficient".  However, for the second dataset, as we performed direct
> comparisons at each time-point, using the correlation coefficient is 
> not
> appropriate, so have decided to use euclidean distance.
>
> Does anyone have experience of what the best distance measure to use is
> for time-courses where direct comparisons are made at each time-point?
>
> Cheers
> Mick
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>