[BioC] DESeq variance question
Simon Anders
anders at embl.de
Sat Dec 3 20:36:10 CET 2011
Dear Steffen
On 2011-12-02 13:53, Steffen Priebe wrote:
> I was using DESeq (and edgeR) for differentially expression analysis.
> In my current dataset I compare 3 biological replicates of control vs. 3 biol. replicates from a mutant.
> The resulting 4 top genes according adjusted pvalue by DESeq and edgeR have a very high variance.
> (The reason for this is, that this are genes located on the chrY and only one replicate of the mutant was male)
>
> My question is now, how can genes with such a high variance of the counts result in this small pvalues?
> Is there any way to avoid this, because I think this are False Positives?
>
> Attached you can find the combined result table of DESeq and edgeR for the top 100 genes.
> The problem occurs for the first 4 genes. The raw counts are stated in columns P-U (P-R: Mutant, T-U Control).
Short answer: I suppose you used version 1.4.x of DESeq. In the new
release (DESeq version 1.6.x), we made some major changes, which should
cause the problem to disappear.
Longer answer: The data frame returned by 'nbinomTest' in the old
version returned, next to the p values, two vectors of "variance
residuals", labeled "resVarA" and "resVarB". The vignette explained that
p values should be considered unreliable if the variance residuals were
too large and advised to disregard such hits. Your Y chromosome genes
certainly had such large values in resVarA or resVarB, and you should
have removed them because of this.
These variance residuals are the ratio of the per-gene estimate of the
variance (which is very imprecise in case of few samples) and the fitted
value found from sharing data across genes (which is stable but may be
misleading in case of genes which behave very different than the other
genes of similar expression range.) Previously, we used only the fitted
dispersion values for the test and left it to the user to filter out
those hits for which the two values were in too much disagreement. Many
users overlooked the need for this last step, others found the solution
unsatisfactory as it turned out to be hard to advise on a good threshold
for the filtering on variance residuals.
The new version solves the issue with a pragmatic and simple approach
that works surprisingly well: DESeq now simply uses the maximum of the
two values. See the updated vignette for more details on this topic.
This costs power but avoids the need for filtering. In our experience,
the power cost is surprisingly low for typical data sets, which, in our
view, justifies the use of such a simple method, at least for now.
You can switch back to the old behaviour, using the 'sharingMode'
argument to the 'estimateDispersions' function. This can be useful to
see how this 'maximum rule' influences your result.
EdgeR, with its empirical Bayesian approach (implemented in its function
'estimateTagwiseDispersion') should typically give p values in the
middle between DESeq's result using the 'maximum' and its the
'fitted-only' sharing modes. However, at least in your case, edgeR
seemed to have stayed too close to the fitted values (or: to the 'common
dispersion', in edgeR's terminology) as you wrote it also gave you p
values for your high-variance genes that you considered implausibly low.
Simon
More information about the Bioconductor
mailing list