[BioC] dispersion in edgeR
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Jun 26 07:19:28 CEST 2010
Hi Naomi,
Thanks for your interesting questions about the edgeR model.
1) We assume that variability in RNA-seq counts come from three
sources:
a) sampling variability associated with sequencing (for each lane),
b) technical variation in library preparation (lane to lane), and
c) biological variation
Sources (b) and (c) affect the underlying concentration of each transcript
in each RNA sample, whereas (a) affects the precision with which this
concentration is measured by sequencing technology. The dispersion
parameter in the edgeR model measures the squared coefficient of variation
(CV) of each transcript's concentration arising from sources (b) and (c).
Experiments suggest that variability from source (b) is relatively minor,
so the dispersion is essentially the squared CV of biological variation.
If you sequence deeply enough, you can theoretically eliminate variability
from source (a) almost entirely. In other words, you can determine almost
perfectly the concentration of each transcript in each sample. However
you can't eliminate biological variability (b+c) in this way. As you
sequence more and more deeply, power to detect differential expression is
eventually determined only by biological variation, hence the asymptote
that you mention. In the edgeR model, this is reflected by the fact that
observed transcript concentrations converge to gamma distributed random
variables with CV = sqrt(dispersion).
To further increase the power to detect differential expression you would
need to reduce biological variability as well, and you could only do that
by increasing the number of biological replicates. This is what the model
predicts.
2) When shrinking the dispersion estimates, the amount of shrinkage
depends on the precision with which the original value is estimated as
well as by the weight of the prior distribution. For a given number of
libraries, larger counts give more reliable estimates of the dispersion
than small counts. Hence dispersions for rare transcripts tend to be
shrunk more than dispersions for very abundant transcripts. Hence the
shrinkage is not monontonic.
Best
Gordon
------------ original message -------------
[BioC] dispersion in edgeR
Naomi Altman naomi at stat.psu.edu
Fri Jun 25 20:03:52 CEST 2010
I have 2 questions about dispersion in edgeR.
1) The model implies that as sequencing depth increases, the power
for testing differential expression comes to an asymptote. This seems
odd.
2) Usually when using a shrinkage estimator, values of the original
estimates shrink monotonely towards the common estimate. So, if one
plots the moderated values against one another for 2 values of the
shrinkage parameter, the plot should be monotone increasing. The
plot was too big to attach, but what I did was:
d10=estimateTagwiseDisp(d,prior.n=10)
d30=estimateTagwiseDisp(d,prior.n=30)
plot(d10$tagwise.dispersion,d30$tagwise.dispersion)
I have not included my particular set of data, as I am pretty sure we
see this for any set.
This plot seems to imply that 2 genes could have the same moderated
dispersion values at prior.n=10 and very different values at
prior.n=30. This is not due to my
--Naomi
Naomi S. Altman 814-865-3791 (voice)
Associate Professor
Dept. of Statistics 814-863-7114 (fax)
Penn State University 814-865-1348 (Statistics)
University Park, PA 16802-2111
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list