[BioC] How to use DESeq to normalize and estimate variance in a RNAseq timecourse analysis

Fri May 11 11:31:13 CEST 2012

Dear Simon,

>> In our dataset, we tried both procedure and we do see a difference in
>> the DESeq output. Maybe, as you said, the  estimation of dispersion is
>> the same for both procedures, but the normalization step (estimation of
>> size Factors ) gives different outputs when using complete or partial
>> tables (with only a subset of the samples)?
>
> yes, this could be, but I'd be surprised if it make much of a 
> difference in the test outcomes. If you are still worried about the 
> issue, maybe post some detilas. What size factors do you get using 
> only one time point at a time and what do you get using all of them 
> together? Can you find an example for a gene where you see an 
> appreciable difference in the p value? If so, are the dispersion 
> estimates the same?

I tried to paste below an example, as you suggested:

Here is an example from our data (the time points after treatments are 
T1, T2, T3, T4, T5, the three controls are Ctrl1, Ctrl2, Ctrl3).

The size factors estimated on the complete table are the following:

Ctrl1 0.811399035473249
Ctrl2 0.858304900598826
Ctrl3 0.959802357106788
T1 0.947672016144435
T2 1.05315240155981
T3 1.13022212977686
T4 1.22731615452888
T5 1.19028477928069

The size factors estimated on a partial table (restricted to Controls + 
T5) are the following:

Ctrl1 0.868784756382365
Ctrl2 0.918880737221278
Ctrl3 1.020617176156
T5 1.2627166738945

As you can see, they seem to be quite different. This seem to translate 
in different numbers of significant genes (between Ctr and T5) for the 
two cases (2755 genes with padj<0.001 when the complete table is taken 
into account, and 2976 genes with padj <0.001 for the partial table is 
taken into account). Furthermore, the lists do not overlap completely:

         FALSE  TRUE
   FALSE 18135   303
   TRUE     82  2673

We picked up randomly two genes (gene A and gene B) and show DEseq 
results comparing Ctrls and T5, after normalizing using the Partial or 
Complete table

Partial table
id     baseMean     baseMeanA     baseMeanB     foldChange     
log2FoldChange pval     padj
geneA     1129.345865     965.1611989     1621.899863     
1.680444536     0.748842926 0.000170905     1.19E-03
Complete table
id     baseMean     baseMeanA     baseMeanB     foldChange     
log2FoldChange pval     padj
geneA 1203.113666     1030.619339     1720.596647 1.669478324     
0.739397362     8.74E-06     9.09E-05

Partial table
id     baseMean     baseMeanA     baseMeanB     foldChange     
log2FoldChange pval     padj
geneB     16.32456228     3.55138732     54.64408717     15.38668758     
3.943610779 4.28E-05     3.47E-04
Complete table
id     baseMean     baseMeanA     baseMeanB     foldChange     
log2FoldChange pval     padj
geneB     17.33523065     3.79053399     57.96932062     15.29318053 
3.934816571     0.001910755     1.06E-02

I hope these results will be sufficiently detailed to be helpful to 
understand our problem. If not, please do not hesitate to ask for more 
information.

Thanks a lot again for your help!

Marie