[BioC] Large logCPM values when using offsets in edgeR for normalization

Gordon K Smyth smyth at wehi.EDU.AU
Sat Mar 23 07:43:32 CET 2013


Dear Ying Wu,

As you have observed, the offset for glmFit() defaults to log(lib.size). 
If a value for offset is user-supplied, then it takes precedence over 
lib.size, and all calculations are done as if lib.size was equal to 
exp(offset).  Hence this affects directly the calculation of logCPM.

If you want to use an offset matrix with glmFit(), but you want the logCPM 
values to reflect the actual library sizes, then you should reset the mean 
value for offset to be the same as the mean value for log(lib.size):

   m <- mean(log(y$samples$lib.size))
   offset <- offset - mean(offset) + m
   fit <- glmFit(y, design, offset=offset)

etc.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth


>> From: Ying Wu <daiyingw at usc.edu>
>> Subject: Large logCPM values when using offsets in edgeR for normalization
>> Date: 21. März 2013 19:17:50 MEZ
>> Mon Mar 18 18:31:12 CET 2013

>>> Dear list,
>>> I am interested in adjusting for GC bias in my experiment. I have my
>>> data in an "SeqExpressionSet" object (using EDASeq library) called eset
>>> and I then run the following commands:
>>>
>>> dataOffset = withinLaneNormalization(eset,"gc", which="full",offset=TRUE)
>>> dataOffset2 = betweenLaneNormalization(dataOffset,
>>> which="full",offset=TRUE)
>>>
>>> y = DGEList(exprs(dataOffset), group = pData(eset)$conditions)
>>> y = calcNormFactors(y,method="TMM")
>>> y$design = model.matrix(~y$samples$group)
>>> y = estimateGLMCommonDisp(y, y$design, offset = -offst(dataOffset2))
>>> y = estimateGLMTagwiseDisp(y, y$design, offset = -offst(dataOffset2))
>>> #no trended
>>> lrt_offset <- glmLRT(glmFit(y,y$design, offset = -offst(dataOffset2)),
>>> coef=2)
>>>
>>> Where group is something like c(1,1,2,2). However, I get very high
>>> logCPM values
>>>
>>>> head(lrt_offset$table)
>>>       logFC   logCPM       LR       PValue
>>> 1 -2.087455 24.90139 18.59469 1.616704e-05
>>> 2 -2.182279 24.71384 21.99644 2.731570e-06
>>> 3 -1.962187 25.17329 14.93835 1.110818e-04
>>> 4 -1.808961 24.67918 13.09471 2.961312e-04
>>> 5 -2.189829 26.08874 22.87277 1.730868e-06
>>> 6 -1.816757 25.37329 15.85988 6.820944e-05
>>>> summary(lrt_offset$table$logCPM)
>>>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>   22.27   24.76   25.32   25.39   25.95   30.58
>>>
>>> Whereas before the glm command, the logCPM values look more like what I
>>> am expecting
>>>
>>>> summary(y$logCPM)
>>>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>>   2.032   4.068   4.716   4.778   5.369  10.260
>>>
>>> Plotting y$logCPM and lrt_offset$table$logCPM however does not result in
>>> a straight line (but it has high correlation). I've noticed in the
>>> manual for both EDASeq and cqn (two packages for GC bias correction) the
>>> logCPM values are also much higher (look under topTags output).
>>>
>>> Could someone shed some insight into why these values are much higher
>>> (statistically)? Computationally, I know that if offsets are not
>>> provided, default offsets of log(lib.size) is used in the glmFit()
>>> function. The code on the svn seems to have been modified quite recently
>>> so maybe this will be resolved in the next update.
>>>
>>> On a related note, I want to adjust for GC bias (with lane
>>> normalization) and also for sequencing depth (between lane
>>> normalization) but I want to do the latter in a simple way (and not
>>> using the betweenLaneNormalization function above), could I do something
>>> like this?
>>>
>>> myOffset = -offst(dataOffset) + log(lib.size)
>>>
>>>> head(offst(dataOffset), n=3)
>>>           s1         s2         y1          y2
>>> 1  0.1870279  0.1033789  0.1605521  0.09555714
>>> 2 -0.3311078 -0.2439545 -0.1688206 -0.23481517
>>> 3  0.4720900  0.2818505  0.6052292  0.63641814
>>>> log(lib.size)
>>> [1] 14.94330 14.68804 13.10804 13.59071
>>>
>>> Where dataOffset was generated above by doing full quantile
>>> normalization using the withinLaneNormalization function above. I was
>>> thinking about this because edgeR automatically generates an offset when
>>> computing glmFit() to adjust for different library sizes (using
>>> log(lib.size) as offset). Could I do something similar but also adjust
>>> for GC biases and have normal looking logCPM values?
>>>
>>> Lastly, what is the best way to get the 'adjusted' count values out of
>>> edgeR? Is it by using the fitted values from glmFit() or using cpm() on
>>> the non-fitted values ? would the latter adjust for offset?
>>>
>>> Thanks,
>>> Ying Wu
>>>
>>>> sessionInfo()
>>> R version 2.15.0 (2012-03-30)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>>>  [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
>>>  [7] LC_PAPER=C                LC_NAME=C
>>>  [9] LC_ADDRESS=C              LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets methods   base
>>>
>>> other attached packages:
>>>  [1] edgeR_3.0.8          limma_3.14.4 EDASeq_1.4.0
>>>  [4] R.oo_1.13.0          R.methodsS3_1.4.2 aroma.light_1.28.0
>>>  [7] matrixStats_0.6.2    ShortRead_1.16.4 latticeExtra_0.6-24
>>> [10] RColorBrewer_1.0-5   Rsamtools_1.10.2 lattice_0.20-10
>>> [13] Biostrings_2.26.3    GenomicRanges_1.10.7 IRanges_1.16.6
>>> [16] Biobase_2.18.0       BiocGenerics_0.4.0 knitr_1.1
>>>
>>> loaded via a namespace (and not attached):
>>>  [1] annotate_1.36.0      AnnotationDbi_1.20.6 bitops_1.0-5
>>>  [4] DBI_0.2-5            DESeq_1.10.1 digest_0.6.3
>>>  [7] evaluate_0.4.3       formatR_0.7 genefilter_1.40.0
>>> [10] geneplotter_1.36.0   grid_2.15.0 hwriter_1.3
>>> [13] markdown_0.5.4       parallel_2.15.0 RSQLite_0.11.2
>>> [16] splines_2.15.0       stats4_2.15.0 stringr_0.6.2
>>> [19] survival_2.37-4      tools_2.15.0 XML_3.95-0.2
>>> [22] xtable_1.7-1         zlibbioc_1.4.0
>>>
>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}


More information about the Bioconductor mailing list