[BioC] A few Q's on using DEXSeq with mucho data

Simon Anders anders at embl.de
Sun Mar 11 12:04:06 CET 2012

Hi Steve

On 2012-03-08 19:31, Steve Lianoglou wrote:
[...]
> I was trying to convey my concern (maybe wrong) that the "fitted
> dispersion" we end up using is a function of the mean read count for
> the bin, where the mean is calculated from the expression of the gene
> across all samples/conditions.
>
> It could be that in one particular two-experiment comparison I want to
> make, the expression of the exon is quite high in both samples. In
> this case, the higher "averaged normalzed count value" (x-axis of fig
> 2 in your pre-paper) would likely be associated w/ a lower dispersion
> when doing the test therefore increasing our power.
>
> It could be, however, that in the rest of the conditions the gene (and
> therefore the exon) would be expressed at a lower level, and the
> dispersion for that bin would then be estimated at a higher amount,
> decreasing the power in this case.
[...]

You are right, this could be an argument in favour of subsetting before
dispersion estimation. I am not quite sure how important this effect is
in practice, though.

Bear in mind that the dispersion does not contain the Poisson noise. For
mean µ and a dispersion α, the variance is v = µ + α µ², and the
coefficient of variance (CV) squared is CV² = v/µ² = 1/µ + α.

Hence, the dominant term for the dependence of variance and hence power
on mean is the Poisson term 1/µ, and not so much any remaining
dependence of α on µ. A negative binomial generalized linear model takes
this into account: it uses the mean-variance relation
v = µ + α µ², with α, not v, considered constant across the model,
precisely because this handles well the effect on variance of
differences in overall mean in the different treatment groups.

Nevertheless, as not only CV² but also α itself seems to decrease with
µ, this is not perfect.

Simon