[BioC] No replicates and differential analysis !!

Aedin Culhane aedin at jimmy.harvard.edu
Thu Jan 26 17:06:12 CET 2006


Hi Nicolas,
I recently had to analyse the same type of data. We had only 2 arrays 
from rare mRNA (each array contained a pool mRNA from 5 animals). Both 
we had only 2 arrays which we wanted to compare. All we could do was 
rank the difference of the genes, and take the maximum fold change. We 
found the expression value/processing of the probeset values made a big 
different to the number of genes that had a >2 fold difference. When we 
apply a mas5 to call the expression value, we had over 2,700 genes with 
greater than a 2 fold change. When gcRMA was used, 260 genes had a 2 
fold difference, and with vsn only 11 genes had a 2 fold difference. I 
have lots of details on this analysis if it will help you. We found most 
of the genes that mas5 called different were in the low expression 
range, and could not be trusted.

We validated 8 genes which we >2 fold different on both vsn and gcRMA 
using RT-PCR.  We had excellent correlation in all cases. vsn does very 
slightly "under-estimate" the fold difference. I would definitely trust 
any genes that have a >2 fold difference when using vsn. I would not 
trust these if they are called using mas5. The glog transformation is 
worth applying particularly in these kinds of analyses.  We found the 
glog-ratio to be reliable. Of course we have no real idea of the number 
of true positives we missed (false -ve). 

By using vsn, and removing the intensity-dependence of the variance. You 
can argue that you have removed the denominator of the T-statistic and 
thus comparing the "mean" difference is valid.  Of course the mean, has 
an n of 1. Thus its just the glog-ratio.  Albeit a woolly assumption, at 
least its gives better basis to your analysis.

The second thing I might consider, is checking for replicate probesets 
on the array, if the replicate probesets agree, then you can be more 
confident in the result. 

Although fold change isn't a good statistical measure, a good variance 
estimate can be difficult.  We just completed a comparison of feature 
selection method (jeffery et al.,) in which we should that at low number 
of replicates (n<5), rankproducts or even fold change can perform as 
well as or outperform t-statistic and moderated t-statistic methods, 
dependent on the variance structure of the data. 

Hope this helps,
Regards
Aedin
--------------------
www.hsph.harvard.edu/researchers/aculhane.html


PDate: Wed, 25 Jan 2006 16:43:51 +0000
From: Wolfgang Huber <huber at ebi.ac.uk>
Subject: Re: [BioC] No replicates and differential analysis !!
To: Nicolas Servant <Nicolas.Servant at curie.fr>
Cc: Bioconductor <bioconductor at stat.math.ethz.ch>
Message-ID: <43D7AAC7.9080401 at ebi.ac.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Hi Nicolas,

 > And it is
 > supported that the FC tends to be greater at low expression levels.

What is supported is that the variance of the _estimate_ of the FC (the
true underlying quantity) by the log-ratio of measured probe intensities
tends to be greater at low expression levels. Indeed this depends on the
preprocessing and background correction. Consider this paper:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=12169536

and the accompanying "vsn" package in bioC. It removes the
intensity-dependence of the variance, and you can use the "glog-ratio",
which is an alternative estimator of FC, to select genes. This amounts
to assuming that all genes have the same variance.

Of course the assumption is not really true, there can be gene-specific
causes for different variances (besides overall intensity). But with
only two arrays you have no way of seeing them. Hence, using glog-ratio
to select genes when there are no replicates is an extreme version of
the moderated t-statistic (which is often used when there are few
replicates).

Best wishes
Wolfgang




Nicolas Servant wrote:

>> Thanks for your answer,
>> But in this case, i have to choose a fold change threshold ! And it is 
>> supported that the FC tends to be greater at low expression levels.
>> For instance a FC greater than 2 for expression values near 50 is 
>> readily seen, but it is low probability to observe FC greater than 2 for 
>> expression values near 1000
>> So i would like to use a more robust approach.
>> 
>> Regards,
>> Nicolas S.
>



More information about the Bioconductor mailing list