[BioC] DESeq variance question

Sat Dec 3 20:36:10 CET 2011

Dear Steffen

On 2011-12-02 13:53, Steffen Priebe wrote:
> I was using DESeq (and edgeR) for differentially expression analysis.
> In my current dataset I compare 3 biological replicates of control vs. 3 biol. replicates from a mutant.
> The resulting 4 top genes according adjusted pvalue by DESeq and edgeR have a very high variance.
> (The reason for this is, that this are genes located on the chrY and only one replicate of the mutant was male)
>
> My question is now, how can genes with such a high variance of the counts result in this small pvalues?
> Is there any way to avoid this, because I think this are False Positives?
>
> Attached you can find the combined result table of DESeq and edgeR for the top 100 genes.
> The problem occurs for the first 4 genes. The raw counts are stated in columns P-U (P-R: Mutant, T-U Control).

Short answer: I suppose you used version 1.4.x of DESeq. In the new 
release (DESeq version 1.6.x), we made some major changes, which should 
cause the problem to disappear.

Longer answer: The data frame returned by 'nbinomTest' in the old 
version returned, next to the p values, two vectors of "variance 
residuals", labeled "resVarA" and "resVarB". The vignette explained that 
p values should be considered unreliable if the variance residuals were 
too large and advised to disregard such hits. Your Y chromosome genes 
certainly had such large values in resVarA or resVarB, and you should 
have removed them because of this.

These variance residuals are the ratio of the per-gene estimate of the 
variance (which is very imprecise in case of few samples) and the fitted 
value found from sharing data across genes (which is stable but may be 
misleading in case of genes which behave very different than the other 
genes of similar expression range.) Previously, we used only the fitted 
dispersion values for the test and left it to the user to filter out 
those hits for which the two values were in too much disagreement. Many 
users overlooked the need for this last step, others found the solution 
unsatisfactory as it turned out to be hard to advise on a good threshold 
for the filtering on variance residuals.

The new version solves the issue with a pragmatic and simple approach 
that works surprisingly well: DESeq now simply uses the maximum of the 
two values. See the updated vignette for more details on this topic.

This costs power but avoids the need for filtering. In our experience, 
the power cost is surprisingly low for typical data sets, which, in our 
view, justifies the use of such a simple method, at least for now.

You can switch back to the old behaviour, using the 'sharingMode' 
argument to the 'estimateDispersions' function. This can be useful to 
see how this 'maximum rule' influences your result.

EdgeR, with its empirical Bayesian approach (implemented in its function 
'estimateTagwiseDispersion') should typically give p values in the 
middle between DESeq's result using the 'maximum' and its the 
'fitted-only' sharing modes. However, at least in your case, edgeR 
seemed to have stayed too close to the fitted values (or: to the 'common 
dispersion', in edgeR's terminology) as you wrote it also gave you p 
values for your high-variance genes that you considered implausibly low.

   Simon