[BioC] DESeq Normalization Question

Sat May 11 02:39:15 CEST 2013

Roughly speaking, the size factors are calculated such that the MA plot 
between any two samples will be centered vertically on zero logFC. If 
you have just a few genes with high abundance in just a few samples, 
then the counts for all other genes in those samples will be depressed 
because the high-abundance genes will be taking a larger fraction of 
the more-or-less fixed number of total reads in the sample. Hence, to 
normalize the counts in these samples to others, they would have to be 
scaled up.

One example where this could happen is if you sequenced healthy cells 
and virus-infected cells, where a majority of mRNA might be viral.

On Fri 10 May 2013 09:38:30 AM PDT, Stephen Turner wrote:
> Simon et al.,
>
> I'm sure this issue has come up before, but I couldn't find an
> appropriate thread or answer either here or SEQanswers.
>
> What feature of the data or the distribution of counts among my
> samples can cause the sizeFactors to vary much more than the raw
> counts / library sizes?
>
> More detail: I'm using DESeq to analyze RNA-seq data mapped with STAR,
> counted with htseq-count. Comparing the "doubleTerm" samples to the
> "wt" samples, there are many genes that appear downregulated. While
> these samples were sequenced, on average, to a similar sequencing
> depth, the normalization factors are much smaller for WT, resulting in
> much larger normalized counts, resulting in more apparently
> downregulated genes in doubleTerm vs WT.
>
>> cds <- newCountDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory)
>> cds <- estimateSizeFactors(cds)
>> cds <- estimateDispersions(cds)
>> data.frame(sizefactors=sizeFactors(cds), rawcounts=colSums(counts(cds, normalized=FALSE)))
>                  sizefactors rawcounts
> S01_wt1           0.9016089  23466349
> S02_wt2           0.7679168  22428603
> S03_wt3           0.7952564  19841959
> S04_wt4           0.7839629  18363384
> S05_pten8w1       1.0301769  20859853
> S06_pten8w2       0.9949514  16809588
> S07_pten8w3       0.9425865  16731071
> S08_pten22w1      1.0826846  18906329
> S09_pten22w2      1.1640354  20164026
> S10_pten22w3      1.0111748  17306468
> S11_double8w1     0.7949001  17671986
> S12_double8w2     1.4509978  23673557
> S13_double8w3     1.1703853  22127841
> S14_doubleterm2   1.0786455  19063010
> S15_doubleterm4   1.1265935  19279814
> S16_doubleterm6   1.3059472  22750403
>
> Thank you.
>
> Stephen
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets
> methods   base
>
> other attached packages:
> [1] DESeq_1.12.0         lattice_0.20-15      locfit_1.5-9
> Biobase_2.20.0
> [5] BiocGenerics_0.6.0   edgeR_3.2.3          limma_3.16.2
> BiocInstaller_1.10.1
>
> loaded via a namespace (and not attached):
>   [1] annotate_1.38.0      AnnotationDbi_1.22.3 DBI_0.2-6
> DESeq2_1.0.9
>   [5] genefilter_1.42.0    geneplotter_1.38.0   GenomicRanges_1.12.2
> grid_3.0.0
>   [9] IRanges_1.18.0       RColorBrewer_1.0-5   RSQLite_0.11.3
> splines_3.0.0
> [13] stats4_3.0.0         survival_2.37-4      tools_3.0.0
> XML_3.95-0.2
> [17] xtable_1.7-1
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor