[BioC] No replicates and differential analysis !!

Thu Jan 26 17:42:26 CET 2006

Again, we need to be careful about what is "validated" by PCR.

If the RNA used for PCR were the same samples hybridized to the 
arrays, you have validated that the arrays "worked" 
technically.  (And this is certainly worth knowing.)

But what we usually want to validate is that the genes are 
differentially expressed in the population, which can only be 
validated by use of an independent sample.

--Naomi

At 11:06 AM 1/26/2006, Aedin Culhane wrote:
>Hi Nicolas,
>I recently had to analyse the same type of data. We had only 2 arrays
>from rare mRNA (each array contained a pool mRNA from 5 animals). Both
>we had only 2 arrays which we wanted to compare. All we could do was
>rank the difference of the genes, and take the maximum fold change. We
>found the expression value/processing of the probeset values made a big
>different to the number of genes that had a >2 fold difference. When we
>apply a mas5 to call the expression value, we had over 2,700 genes with
>greater than a 2 fold change. When gcRMA was used, 260 genes had a 2
>fold difference, and with vsn only 11 genes had a 2 fold difference. I
>have lots of details on this analysis if it will help you. We found most
>of the genes that mas5 called different were in the low expression
>range, and could not be trusted.
>
>We validated 8 genes which we >2 fold different on both vsn and gcRMA
>using RT-PCR.  We had excellent correlation in all cases. vsn does very
>slightly "under-estimate" the fold difference. I would definitely trust
>any genes that have a >2 fold difference when using vsn. I would not
>trust these if they are called using mas5. The glog transformation is
>worth applying particularly in these kinds of analyses.  We found the
>glog-ratio to be reliable. Of course we have no real idea of the number
>of true positives we missed (false -ve).
>
>By using vsn, and removing the intensity-dependence of the variance. You
>can argue that you have removed the denominator of the T-statistic and
>thus comparing the "mean" difference is valid.  Of course the mean, has
>an n of 1. Thus its just the glog-ratio.  Albeit a woolly assumption, at
>least its gives better basis to your analysis.
>
>The second thing I might consider, is checking for replicate probesets
>on the array, if the replicate probesets agree, then you can be more
>confident in the result.
>
>Although fold change isn't a good statistical measure, a good variance
>estimate can be difficult.  We just completed a comparison of feature
>selection method (jeffery et al.,) in which we should that at low number
>of replicates (n<5), rankproducts or even fold change can perform as
>well as or outperform t-statistic and moderated t-statistic methods,
>dependent on the variance structure of the data.
>
>Hope this helps,
>Regards
>Aedin
>--------------------
>www.hsph.harvard.edu/researchers/aculhane.html
>
>
>PDate: Wed, 25 Jan 2006 16:43:51 +0000
>From: Wolfgang Huber <huber at ebi.ac.uk>
>Subject: Re: [BioC] No replicates and differential analysis !!
>To: Nicolas Servant <Nicolas.Servant at curie.fr>
>Cc: Bioconductor <bioconductor at stat.math.ethz.ch>
>Message-ID: <43D7AAC7.9080401 at ebi.ac.uk>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>Hi Nicolas,
>
>  > And it is
>  > supported that the FC tends to be greater at low expression levels.
>
>What is supported is that the variance of the _estimate_ of the FC (the
>true underlying quantity) by the log-ratio of measured probe intensities
>tends to be greater at low expression levels. Indeed this depends on the
>preprocessing and background correction. Consider this paper:
>
>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=12169536
>
>and the accompanying "vsn" package in bioC. It removes the
>intensity-dependence of the variance, and you can use the "glog-ratio",
>which is an alternative estimator of FC, to select genes. This amounts
>to assuming that all genes have the same variance.
>
>Of course the assumption is not really true, there can be gene-specific
>causes for different variances (besides overall intensity). But with
>only two arrays you have no way of seeing them. Hence, using glog-ratio
>to select genes when there are no replicates is an extreme version of
>the moderated t-statistic (which is often used when there are few
>replicates).
>
>Best wishes
>Wolfgang
>
>
>
>
>Nicolas Servant wrote:
>
> >> Thanks for your answer,
> >> But in this case, i have to choose a fold change threshold ! And it is
> >> supported that the FC tends to be greater at low expression levels.
> >> For instance a FC greater than 2 for expression values near 50 is
> >> readily seen, but it is low probability to observe FC greater than 2 for
> >> expression values near 1000
> >> So i would like to use a more robust approach.
> >>
> >> Regards,
> >> Nicolas S.
> >
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111