[BioC] Understanding limma, fdr and topTable

Tue Jul 8 19:17:13 CEST 2008

HI all,

My views on filter based on variance run more towards Aaron's. We had 
another conversation about this recently on the list 
(https://stat.ethz.ch/pipermail/bioconductor/2008-June/022941.html). 
Robert Gentleman asked me if I had any evidence to support my 
suspicions; I've seen some cases where even throwing "Absent" probe 
sets affects the eBayes calculation of the p-value so that they are 
larger than when all probe sets are used. I tried a couple of years 
ago to really investigate this, but I couldn't find a good way to 
adequately generate microarray data with known numbers of DE genes.

Does anybody know of a good microarray data simulator that gives data 
that looks like real data? If so, I could do some playing around with 
simulations and present a poster at BioC that we can hack apart :)

Cheers,
Jenny

At 10:17 AM 7/8/2008, aaron.j.mackey at gsk.com wrote:
> > I would add that removing those genes that are unchanged in any sample
> > will also help reduce the multiplicity problem. Regardless of the
> > expression level, those genes that never change expression are
> > uninteresting by default, so e.g., if beta-actin is highly expressed at
> > the same level in all samples we don't really care to test for
> > differential expression for that gene since it apparently is not
> > differentially expressed.
>
>This doesn't make sense.  How can I choose to filter out "unchanged"
>probesets without fitting a model of some sort, and making a probabilistic
>decision for each probeset about whether it is "unchanged" or not.  Every
>probeset (save those below the detection limit) will exhibit variance
>(though the variance may be below the precision of the instrument to
>measure), right?  You're not suggesting that there are some probesets with
>zero variance?
>
>It seems to me that this approach leads to a false/erroneous reduction in
>the multiplicity problem, as you've just moved the hypothesis testing into
>a separate "phase" of the analysis.  And it also would mess up pooled
>variance estimates such as those used in eBayes-based methods (e.g.
>limma).
>
>So, while I might be willing to filter out known "dead" probesets (that I
>never see above detection threshold over many hundreds of assays), I'm in
>the camp that the statistics are corrupt if you filter without regard to
>its affect on multiplicity corrections.
>
>As an aside, it should be possible to fit some of the models using
>truncated/censored distributions (wherein the statistical model gets to
>know that there were X number of probesets with values < threshold, but
>doesn't pretend that those values are real).  That's an idea for the model
>developers to ponder ...
>
>-Aaron
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at illinois.edu