[BioC] Large logCPM values when using offsets in edgeR for normalization

Ying Wu daiyingw at gmail.com
Mon Mar 18 18:31:12 CET 2013


Dear list,
I am interested in adjusting for GC bias in my experiment. I have my
data in an "SeqExpressionSet" object (using EDASeq library) called eset
and I then run the following commands:

dataOffset = withinLaneNormalization(eset,"gc", which="full",offset=TRUE)
dataOffset2 = betweenLaneNormalization(dataOffset, which="full",offset=TRUE)

y = DGEList(exprs(dataOffset), group = pData(eset)$conditions)
y = calcNormFactors(y,method="TMM")
y$design = model.matrix(~y$samples$group)
y = estimateGLMCommonDisp(y, y$design, offset = -offst(dataOffset2))
y = estimateGLMTagwiseDisp(y, y$design, offset = -offst(dataOffset2))
#no trended
lrt_offset <- glmLRT(glmFit(y,y$design, offset = -offst(dataOffset2)),
coef=2)

Where group is something like c(1,1,2,2). However, I get very high
logCPM values

> head(lrt_offset$table)
       logFC   logCPM       LR       PValue
1 -2.087455 24.90139 18.59469 1.616704e-05
2 -2.182279 24.71384 21.99644 2.731570e-06
3 -1.962187 25.17329 14.93835 1.110818e-04
4 -1.808961 24.67918 13.09471 2.961312e-04
5 -2.189829 26.08874 22.87277 1.730868e-06
6 -1.816757 25.37329 15.85988 6.820944e-05
> summary(lrt_offset$table$logCPM)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   22.27   24.76   25.32   25.39   25.95   30.58

Whereas before the glm command, the logCPM values look more like what I
am expecting

> summary(y$logCPM)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   2.032   4.068   4.716   4.778   5.369  10.260

Plotting y$logCPM andlrt_offset$table$logCPM however does not result in
a straight line (but it has high correlation). I've noticed in the
manual for both EDASeq and cqn (two packages for GC bias correction) the
logCPM values are also much higher (look under topTags output).

Could someone shed some insight into why these values are much higher
(statistically)? Computationally, I know that if offsets are not
provided, default offsets of log(lib.size) is used in the glmFit()
function. The code on the svn seems to have been modified quite recently
so maybe this will be resolved in the next update.

On a related note, I want to adjust for GC bias (with lane
normalization) and also for sequencing depth (between lane
normalization) but I want to do the latter in a simple way (and not
using the betweenLaneNormalization function above), could I do something
like this?

myOffset = -offst(dataOffset) + log(lib.size)

> head(offst(dataOffset), n=3)
           s1         s2         y1          y2
1  0.1870279  0.1033789  0.1605521  0.09555714
2 -0.3311078 -0.2439545 -0.1688206 -0.23481517
3  0.4720900  0.2818505  0.6052292  0.63641814
>log(lib.size)
[1] 14.94330 14.68804 13.10804 13.59071

Where dataOffset was generated above by doing full quantile
normalization using the withinLaneNormalization function above. I was
thinking about this because edgeR automatically generates an offset when
computing glmFit() to adjust for different library sizes (using
log(lib.size) as offset). Could I do something similar but also adjust
for GC biases and have normal looking logCPM values?

Lastly, what is the best way to get the 'adjusted' count values out of
edgeR? Is it by using the fitted values from glmFit() or using cpm() on
the non-fitted values ? would the latter adjust for offset?

Thanks,
Ying Wu

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
  [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
  [7] LC_PAPER=C                LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets methods   base

other attached packages:
  [1] edgeR_3.0.8          limma_3.14.4 EDASeq_1.4.0
  [4] R.oo_1.13.0          R.methodsS3_1.4.2 aroma.light_1.28.0
  [7] matrixStats_0.6.2    ShortRead_1.16.4 latticeExtra_0.6-24
[10] RColorBrewer_1.0-5   Rsamtools_1.10.2 lattice_0.20-10
[13] Biostrings_2.26.3    GenomicRanges_1.10.7 IRanges_1.16.6
[16] Biobase_2.18.0       BiocGenerics_0.4.0 knitr_1.1

loaded via a namespace (and not attached):
  [1] annotate_1.36.0      AnnotationDbi_1.20.6 bitops_1.0-5
  [4] DBI_0.2-5            DESeq_1.10.1 digest_0.6.3
  [7] evaluate_0.4.3       formatR_0.7 genefilter_1.40.0
[10] geneplotter_1.36.0   grid_2.15.0 hwriter_1.3
[13] markdown_0.5.4       parallel_2.15.0 RSQLite_0.11.2
[16] splines_2.15.0       stats4_2.15.0 stringr_0.6.2
[19] survival_2.37-4      tools_2.15.0 XML_3.95-0.2
[22] xtable_1.7-1         zlibbioc_1.4.0



More information about the Bioconductor mailing list