[BioC] Understanding limma, fdr and topTable

James MacDonald jmacdon at med.umich.edu
Wed Jul 9 08:17:27 CEST 2008



aaron.j.mackey at gsk.com wrote:
>> I would add that removing those genes that are unchanged in any sample 
>> will also help reduce the multiplicity problem. Regardless of the 
>> expression level, those genes that never change expression are 
>> uninteresting by default, so e.g., if beta-actin is highly expressed at 
>> the same level in all samples we don't really care to test for 
>> differential expression for that gene since it apparently is not 
>> differentially expressed.
> 
> This doesn't make sense.  How can I choose to filter out "unchanged" 
> probesets without fitting a model of some sort, and making a probabilistic 
> decision for each probeset about whether it is "unchanged" or not.  Every 
> probeset (save those below the detection limit) will exhibit variance 
> (though the variance may be below the precision of the instrument to 
> measure), right?  You're not suggesting that there are some probesets with 
> zero variance?

I don't really understand your point here. First, I never suggested 
fitting a model of any kind to select unchanged probesets, unless 
computing the variance is some kind of newfangled model fitting that I 
don't understand.

In addition, are you really claiming that a probeset that is 'below the 
detection limit' (whatever that means) will _not_ have any variance? I 
would say that doesn't make any sense. All expression values will 
exhibit some level of variance regardless of whether you might think 
they are 'below the detection limit'.

> 
> It seems to me that this approach leads to a false/erroneous reduction in 
> the multiplicity problem, as you've just moved the hypothesis testing into 
> a separate "phase" of the analysis.  And it also would mess up pooled 
> variance estimates such as those used in eBayes-based methods (e.g. 
> limma).

So yes, if I had actually advocated fitting a model you would be 
correct. However, simply deciding to exclude probesets that have a low 
variance will not affect the hypothesis testing. Although it could have 
an effect on the computation of the pooled variance estimates if you 
remove too many probesets as the pooled variance might increase.

But the same can be said for any filtering method. If you remove a lot 
of probesets of low intensity (say all those with an absent call) then 
you very well could be removing probesets with a higher variance and 
then mess up the estimate of the pooled variance as well.

As with all statistics there are tradeoffs and assumptions that are 
being made regardless of what you do.

> 
> So, while I might be willing to filter out known "dead" probesets (that I 
> never see above detection threshold over many hundreds of assays), I'm in 
> the camp that the statistics are corrupt if you filter without regard to 
> its affect on multiplicity corrections.

I don't really know what you mean by 'detection limit'. Has someone 
published something somewhere that says a probeset with an expression 
value below X means the mRNA for that gene has not been detected?

I am not sure how the filtering step will affect multiplicity 
corrections. If one were to use a two-stage modeling procedure that you 
seem to think I am advocating then of course the p-values themselves 
would be questionable as assumptions would have been violated. But I 
don't know where multiplicity correction comes into the equation.

But personally I am not that much of a purist about multiplicity anyway. 
  I have been known to select probesets based on adjusted p-value and a 
fold change criterion as well, which completely invalidates the meaning 
of the adjusted p-values.

Best,

Jim



> 
> As an aside, it should be possible to fit some of the models using 
> truncated/censored distributions (wherein the statistical model gets to 
> know that there were X number of probesets with values < threshold, but 
> doesn't pretend that those values are real).  That's an idea for the model 
> developers to ponder ...
> 
> -Aaron
> 

-- 
James W. MacDonald, MS
Biostatistician
UMCCC cDNA and Affymetrix Core
University of Michigan
1500 E Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623



More information about the Bioconductor mailing list