[BioC] DESeq: residuals instead of read counts
Simon Anders
anders at embl.de
Mon Jan 17 11:54:42 CET 2011
Dear Tuuli
On 01/16/2011 10:30 AM, Tuuli Lappalainen wrote:
> I've been using DESeq for RNASeq data analysis of differential
> expression of genes and exons. However, I have a dilemma: Instead of
> using read count data, I would like to use residuals of linear
> regression (after rescaling), but I'm unsure whether this is a valid
> thing to do.
>
> The reason to use residuals is the correction for variation in insert
> size between lanes (as e.g. in Montgomery et al. Nature 2010) - or
> potentially some other confounders as well. I have been planning to
> rescale the residuals to have all the values >0, and the median across
> the samples per gene the same as in the original count data, and use
> these data then similarly to raw counts. Would this be a proper way to
> analyze the data?
No, this would not work well because it invalidates the shot noise
estimation. The proper place to introduce such corrections would be
alongside the size factors. Unfortunately, DESeq does not offer, at the
moment, the functionality to enter such extra factors, but it seems I
really should add this, as it is actually trivial.
The alternative would be to use GLMs.
Maybe a few notes about both possibilities.
(a) DESeq models the counts K_ij of gene i in sample j as following a
negative-binomial distribution with mean s_j q_ij. Here, q_ij is the
expected concentration of transcripts from gene i under the condition
(treatment) of sample j, i.e., q_ij is the same for all samples j within
a replicate group. But even within a replicate group, the exepctation
values of the counts K_ij will be different -- primarily because the
sequencing depth varies from sample to sample. Hence, we multiply q_ij
with the "size factor" s_j, which models the relation between
(biological) transcript concentration and (technical) read-out in terms
of read counts.
At the moment, this factor s_j only depends on the sample j but not on
the gene i. If your confounder also only depends on j and not i, you can
simply multiply it to the size factor (provided, you have estimated the
confounder as a multiplicative rather than an additive correction).
If it depends on i, too, the math says all the same -- only, there is no
slot in the CountDataSet object to store this information. Maybe I
should add this, as you are not the first one to ask. Other users wanted
to incorportate a GC correction (as used in the Pickrell et al. paper),
and this would be done the same way.
(b) In the last release, DESeq got the functionality to fit GLMs -- only
this is not yet mentioned in the vignette but it is documented in the
help pages. With this, you can do a per-gene confounder estimation and
the main effect estimation in a unified fashion.
In brief, it works as follows:
- When calling 'newCountDataSet', the second argument is no longer a
factor of conditions but rather a data frame with the predictors, one
row for each column in the count table, and one column for each factor.
(in principle, some of the predictors could be numerical, rather than
factors.)
- When calling 'estimateVarianceFunctions', use the option
'method="pooled"'. This estimates one pooled variance for all samples.
- Instead of 'nbinomTest', use 'nbinomFitGLM' and 'nbinomTestGLM', as
follows (assuming that the model frame contains two columns,
'confounder' and 'treatment'):
# Fit the reduced model (takes a few minutes):
fit0 <- nbinomFitGLM( cds, count ~ confounder )
# Fit the full model (takes a few minutes):
fit1 <- nbinomFitGLM( cds, count ~ confounder + treatment )
# Calculate the pvalues by a simple chi2 test:
pvals <- nbinomGLMTest( fit1, fit0 )
This make sense only if you expect your confounder to change in value
for each gene. If it doesn't it should go into the size factors (or,
more generally, into the GLM's offset.)
The catch is that the new "pooled" variance estimation only works if the
design has "cell replicates", i.e., if the design matrix has replicated
rows, and this is often not the case. The authors of the "edgeR" package
found an ingenious way around the problem (their newest version uses
what they call a Cox-Reid dispersion estimation), and I guess, I should
take over this idea.
Cheers
Simon
More information about the Bioconductor
mailing list