[BioC] dispersion in edgeR

Sat Jun 26 07:19:28 CEST 2010

Hi Naomi,

Thanks for your interesting questions about the edgeR model.

1)  We assume that variability in RNA-seq counts come from three 
sources:

a) sampling variability associated with sequencing (for each lane),
b) technical variation in library preparation (lane to lane), and
c) biological variation

Sources (b) and (c) affect the underlying concentration of each transcript 
in each RNA sample, whereas (a) affects the precision with which this 
concentration is measured by sequencing technology.  The dispersion 
parameter in the edgeR model measures the squared coefficient of variation 
(CV) of each transcript's concentration arising from sources (b) and (c). 
Experiments suggest that variability from source (b) is relatively minor, 
so the dispersion is essentially the squared CV of biological variation.

If you sequence deeply enough, you can theoretically eliminate variability 
from source (a) almost entirely.  In other words, you can determine almost 
perfectly the concentration of each transcript in each sample.  However 
you can't eliminate biological variability (b+c) in this way.  As you 
sequence more and more deeply, power to detect differential expression is 
eventually determined only by biological variation, hence the asymptote 
that you mention.  In the edgeR model, this is reflected by the fact that 
observed transcript concentrations converge to gamma distributed random 
variables with CV = sqrt(dispersion).

To further increase the power to detect differential expression you would 
need to reduce biological variability as well, and you could only do that 
by increasing the number of biological replicates.  This is what the model 
predicts.

2) When shrinking the dispersion estimates, the amount of shrinkage 
depends on the precision with which the original value is estimated as 
well as by the weight of the prior distribution.  For a given number of 
libraries, larger counts give more reliable estimates of the dispersion 
than small counts.  Hence dispersions for rare transcripts tend to be 
shrunk more than dispersions for very abundant transcripts.  Hence the 
shrinkage is not monontonic.

Best
Gordon

------------ original message -------------
[BioC] dispersion in edgeR
Naomi Altman naomi at stat.psu.edu
Fri Jun 25 20:03:52 CEST 2010

I have 2 questions about dispersion in edgeR.

1) The model implies that as sequencing depth increases, the power
for testing differential expression comes to an asymptote.  This seems 
odd.

2) Usually when using a shrinkage estimator, values of the original
estimates shrink monotonely towards the common estimate.  So, if one
plots the moderated values against one another for 2 values of the
shrinkage parameter, the plot should be monotone increasing.  The
plot was too big to attach,  but what I did was:

d10=estimateTagwiseDisp(d,prior.n=10)
d30=estimateTagwiseDisp(d,prior.n=30)
plot(d10$tagwise.dispersion,d30$tagwise.dispersion)

I have not included my particular set of data, as I am pretty sure we
see this for any set.

This plot seems to imply that 2 genes could have the same moderated
dispersion values at prior.n=10 and very different values at
prior.n=30.  This is not due to my

--Naomi

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}