[BioC] working with two factors in DESeq to analyze RNA-seq
Simon Anders
anders at embl.de
Wed Feb 9 11:53:45 CET 2011
Dear Fernando
On 02/08/2011 06:02 PM, Biase, Fernando wrote:
> Dear Dr Anders,
>
> Based on my readings I am willing to use DESeq to analyze my RNA-seq
> data. I found it very easy to follow your instructions written on the
> package vignette, and to work with it for one factor comparison.
> However, I am not sure if I am using it correctly to account for a
> second factor in the model. So, here are my two questions:
>
> How do I compare two groups, accounting for a second factor in the
> model? (it is not a factorial model that I am looking for)
>
> Can I simply write: The condition names as "Ax", "Ax", "Ax", "Ax",
> "Ay", "Ay", "Ay", "Bx", "Bx", "Bx", "By", "By", "By"
>
> nbinomTest ( cds, “A”, “B”)
>
> being A/B may main factor and x/y my secondary factor?
[...]
Not quite. You need to use the new GLM functionality that I added to
DESeq in the newest release (current version: 1.2.1).
First, when calling 'newCountDataSet', pass for 'condition' not a vector
but a data frame with the design, i.e., something like this:
design <- data.frame(
treatment = c( "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B" ),
block = c( "x", "x", "x", "x", "y", "y", "y",
"x", "x", "x", "y", "y", "y" ) )
cds <- newCountDataSet( countTable, design )
Then, estimate the size factors and the variance functions, the later
using the new "pooled" mode:
cds <- estimateSizeFactors( cds)
cds <- estimateVarianceFunctions( cds, method="pooled" )
Now, you can fit two models, a null model, which only account for your
blocking factor, and the full model, which also takes into account your
factor of interest:
fit0 <- nbinomFitGLM( cds, count ~ block )
fit1 <- nbinomFitGLM( cds, count ~ block + treatment )
The fitting might take a few minutes. Once it is finished, you get your
p values with a chi-squared test:
pvals <- nbinomGLMTest( fit1, fit0 )
These p values inform you whether the treatment has a significant effect
on the counts, after the effect of the blocking factor has been taken
into account.
Remember to adjust the p values for multiple testing before you look for
hits:
padj <- p.adjust( pvals, method="BH" )
If your count table had gene symbols in the row names, you can now list
your hits:
rownames(counts(cds))[ padj < .1 ]
This new functionality is not yet properly documented in the vignette
because I have not yet found a public data set that would be suitable to
demonstrate it.
Hence, as it has not been used that widely, I'd be interested to hear
how well it works for you.
Best regards
Simon
More information about the Bioconductor
mailing list