[BioC] No replicates and differential analysis !!
Aedin Culhane
aedin at jimmy.harvard.edu
Thu Jan 26 17:06:12 CET 2006
Hi Nicolas,
I recently had to analyse the same type of data. We had only 2 arrays
from rare mRNA (each array contained a pool mRNA from 5 animals). Both
we had only 2 arrays which we wanted to compare. All we could do was
rank the difference of the genes, and take the maximum fold change. We
found the expression value/processing of the probeset values made a big
different to the number of genes that had a >2 fold difference. When we
apply a mas5 to call the expression value, we had over 2,700 genes with
greater than a 2 fold change. When gcRMA was used, 260 genes had a 2
fold difference, and with vsn only 11 genes had a 2 fold difference. I
have lots of details on this analysis if it will help you. We found most
of the genes that mas5 called different were in the low expression
range, and could not be trusted.
We validated 8 genes which we >2 fold different on both vsn and gcRMA
using RT-PCR. We had excellent correlation in all cases. vsn does very
slightly "under-estimate" the fold difference. I would definitely trust
any genes that have a >2 fold difference when using vsn. I would not
trust these if they are called using mas5. The glog transformation is
worth applying particularly in these kinds of analyses. We found the
glog-ratio to be reliable. Of course we have no real idea of the number
of true positives we missed (false -ve).
By using vsn, and removing the intensity-dependence of the variance. You
can argue that you have removed the denominator of the T-statistic and
thus comparing the "mean" difference is valid. Of course the mean, has
an n of 1. Thus its just the glog-ratio. Albeit a woolly assumption, at
least its gives better basis to your analysis.
The second thing I might consider, is checking for replicate probesets
on the array, if the replicate probesets agree, then you can be more
confident in the result.
Although fold change isn't a good statistical measure, a good variance
estimate can be difficult. We just completed a comparison of feature
selection method (jeffery et al.,) in which we should that at low number
of replicates (n<5), rankproducts or even fold change can perform as
well as or outperform t-statistic and moderated t-statistic methods,
dependent on the variance structure of the data.
Hope this helps,
Regards
Aedin
--------------------
www.hsph.harvard.edu/researchers/aculhane.html
PDate: Wed, 25 Jan 2006 16:43:51 +0000
From: Wolfgang Huber <huber at ebi.ac.uk>
Subject: Re: [BioC] No replicates and differential analysis !!
To: Nicolas Servant <Nicolas.Servant at curie.fr>
Cc: Bioconductor <bioconductor at stat.math.ethz.ch>
Message-ID: <43D7AAC7.9080401 at ebi.ac.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi Nicolas,
> And it is
> supported that the FC tends to be greater at low expression levels.
What is supported is that the variance of the _estimate_ of the FC (the
true underlying quantity) by the log-ratio of measured probe intensities
tends to be greater at low expression levels. Indeed this depends on the
preprocessing and background correction. Consider this paper:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=12169536
and the accompanying "vsn" package in bioC. It removes the
intensity-dependence of the variance, and you can use the "glog-ratio",
which is an alternative estimator of FC, to select genes. This amounts
to assuming that all genes have the same variance.
Of course the assumption is not really true, there can be gene-specific
causes for different variances (besides overall intensity). But with
only two arrays you have no way of seeing them. Hence, using glog-ratio
to select genes when there are no replicates is an extreme version of
the moderated t-statistic (which is often used when there are few
replicates).
Best wishes
Wolfgang
Nicolas Servant wrote:
>> Thanks for your answer,
>> But in this case, i have to choose a fold change threshold ! And it is
>> supported that the FC tends to be greater at low expression levels.
>> For instance a FC greater than 2 for expression values near 50 is
>> readily seen, but it is low probability to observe FC greater than 2 for
>> expression values near 1000
>> So i would like to use a more robust approach.
>>
>> Regards,
>> Nicolas S.
>
More information about the Bioconductor
mailing list