[BioC] Understanding limma, fdr and topTable

Tue Jul 8 17:17:02 CEST 2008

> I would add that removing those genes that are unchanged in any sample 
> will also help reduce the multiplicity problem. Regardless of the 
> expression level, those genes that never change expression are 
> uninteresting by default, so e.g., if beta-actin is highly expressed at 
> the same level in all samples we don't really care to test for 
> differential expression for that gene since it apparently is not 
> differentially expressed.

This doesn't make sense.  How can I choose to filter out "unchanged" 
probesets without fitting a model of some sort, and making a probabilistic 
decision for each probeset about whether it is "unchanged" or not.  Every 
probeset (save those below the detection limit) will exhibit variance 
(though the variance may be below the precision of the instrument to 
measure), right?  You're not suggesting that there are some probesets with 
zero variance?

It seems to me that this approach leads to a false/erroneous reduction in 
the multiplicity problem, as you've just moved the hypothesis testing into 
a separate "phase" of the analysis.  And it also would mess up pooled 
variance estimates such as those used in eBayes-based methods (e.g. 
limma).

So, while I might be willing to filter out known "dead" probesets (that I 
never see above detection threshold over many hundreds of assays), I'm in 
the camp that the statistics are corrupt if you filter without regard to 
its affect on multiplicity corrections.

As an aside, it should be possible to fit some of the models using 
truncated/censored distributions (wherein the statistical model gets to 
know that there were X number of probesets with values < threshold, but 
doesn't pretend that those values are real).  That's an idea for the model 
developers to ponder ...

-Aaron