[BioC] Remove batch effect in small RNASeq study (SVA or others?)

Tue Apr 29 04:02:53 CEST 2014

Hi Shirley,

I beleive that "normalized.lib.sizes = TRUE" has become the default 
when calling the cpm function on a DGEList, so you should not need to 
specify it.

-Ryan

On Mon Apr 28 18:59:02 2014, shirley zhang wrote:
>   Dear Dr. Smyth,
>
> Thank you very much for taking your time to look through my codes, and also
> provided an more approciate code sequence. Thank you!
>
> By following your codes, I think the batch effect was removed efficiently
> as shown in the attached figures. Do you agree?
>
> Furthermore, I found a very useful paper you published in Nature Protocol
> 2013.
> http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
>
> In the paper, you wrote:
> d = DGEList(counts = count, group = samples$condition)
> d = calcNormFactors(d)
> nc = cpm(d, normalized.lib.sizes = TRUE)
>
> My question is:
> In my case, should I add option "normalized.lib.sizes = TRUE" in cpm()
> after calling calcNormFactors() like the following:
> dge <- DGEList(counts=count)
> dge <- calcNormFactors(dge)
> logCPM <- cpm(dge,log=TRUE,prior.count=5, normalized.lib.sizes = TRUE)
>
> Many thanks for your help,
> Shirley
>
>
>
> On Mon, Apr 28, 2014 at 8:06 PM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> To add colours to the MDS plot:
>>
>>   plotMDS(logCPM, col=as.numeric(batch))
>>
>> Gordon
>>
>>
>> On Tue, 29 Apr 2014, Gordon K Smyth wrote:
>>
>>   Dear Shirley,
>>>
>>> I don't think that scaling in prcomp() is appropriate here.
>>>
>>> Better would be:
>>>
>>> library(edgeR)
>>> dge <- DGEList(counts=count)
>>> dge <- calcNormFactors(dge)
>>> logCPM <- cpm(dge,log=TRUE,prior.count=5)
>>> plotMDS(logCPM)
>>> logCPM <- removeBatchEffect(logCPM,batch=batch)
>>> plotMDS(logCPM)
>>>
>>> Best wishes
>>> Gordon
>>>
>>>
>>> On Mon, 28 Apr 2014, shirley zhang wrote:
>>>
>>>   Dear Dr. Smyth,
>>>>
>>>> Thanks for your reply. Here is the code sequence I used to remove the
>>>> batch
>>>> effect and to make the PCA plot.
>>>>
>>>> First,  getting raw counts for each feature per sample using htseq-count
>>>> (~64K features by using Ensemble.gtf)
>>>> Then, getting a count data.matrix by combing all samples together (64K
>>>> rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.
>>>>
>>>>   count = cbind(count.s1, count.s2, ...., count.s14)
>>>>>
>>>>
>>>> #remove the batch effect
>>>> library(edgeR)
>>>> batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>>>>
>>>> logCPM <- cpm(count,log=TRUE,prior.count=5)
>>>> logCPM <- removeBatchEffect(logCPM, batch=batch)
>>>>
>>>> #PCA
>>>> B.res = prcomp(logCPM, scale = TRUE)
>>>> pc.s = summary(.Bres)$importance[2,1:2]
>>>> pc1.var = round(pc.s[["PC1"]],2)
>>>> pc2.var = round(pc.s[["PC2"]],2)
>>>>
>>>> pdf(file = "all.count.logCPM.rmBatch.pca")
>>>> plot(B.res$rotation[,1:2], main = maintitle,  xlab = paste("PC1
>>>> (variance:
>>>> ",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance:
>>>> ",pc2.var*100,
>>>> "%)", sep =""))
>>>> dev.off()
>>>>
>>>>   sessionInfo()
>>>>>
>>>> R version 3.0.2 (2013-09-25)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] edgeR_3.4.0  limma_3.18.7
>>>>
>>>>
>>>> Many thanks.
>>>> Shirley
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth <smyth at wehi.edu.au>
>>>> wrote:
>>>>
>>>>
>>>>>   Date: Sun, 27 Apr 2014 20:32:55 -0400
>>>>>
>>>>>> From: shirley zhang <shirley0818 at gmail.com>
>>>>>> To: Gordon K Smyth <smyth at wehi.edu.au>
>>>>>> Cc: Bioconductor mailing list <bioconductor at r-project.org>
>>>>>> Subject: Re: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>>>          others?)
>>>>>>
>>>>>> Dear Dr. Smyth,
>>>>>>
>>>>>> Thank you very much for your quick reply. I did as you suggested by
>>>>>> first
>>>>>> getting log CPM value, then call removeBatchEffect(). I found the PCA
>>>>>> figure looks better than before, but there is still a batch effect.
>>>>>>
>>>>>>
>>>>> I don't see how there could still be a batch effect.
>>>>>
>>>>> Please give the code sequence you used to remove the batch effect and to
>>>>> make the PCA plot.
>>>>>
>>>>>   I attached two PCA figures. One is based on log10(raw count) which is
>>>>>
>>>>>> before calling cpm() and removeBatchEffect(). Another one is after.
>>>>>>
>>>>>> Could you look at them and give me more suggestions. Will a quantile
>>>>>> normalization across samples be a good idea since CPM() is still a
>>>>>> normalization only within each sample??
>>>>>>
>>>>>>
>>>>> You should normalize the data before using removeBatchEffect(), and
>>>>> quantile is one possibility.
>>>>>
>>>>> Gordon
>>>>>
>>>>>   Thanks again for your help,
>>>>>
>>>>>> Shirley
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 27, 2014 at 6:54 PM, Gordon K Smyth <smyth at wehi.edu.au>
>>>>>> wrote:
>>>>>>
>>>>>>   Dear Shirley,
>>>>>>
>>>>>>>
>>>>>>> I would probably do it like this:
>>>>>>>
>>>>>>>    library(edgeR)
>>>>>>>    logCPM <- cpm(y,log=TRUE,prior.count=5)
>>>>>>>    logCPM <- removeBatchEffect(logCPM, batch=batch)
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Gordon
>>>>>>>
>>>>>>>   Date: Sat, 26 Apr 2014 10:51:23 -0400
>>>>>>>
>>>>>>>   From: shirley zhang <shirley0818 at gmail.com>
>>>>>>>> To: Bioconductor Mailing List <bioconductor at stat.math.ethz.ch>
>>>>>>>> Subject: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>>>>>          others?)
>>>>>>>>
>>>>>>>> I have a RNASeq paired-end data from two batches (8 samples from
>>>>>>>> batch1,
>>>>>>>> and 7 samples from batch2). After alignment using TopHat2, then I got
>>>>>>>> count
>>>>>>>> using HTseq-count, and FPKM value via Cufflinks. A big batch effect
>>>>>>>> can
>>>>>>>> be
>>>>>>>> viewed in PCA using both log10(raw count) and log10(FPKM) value.
>>>>>>>>
>>>>>>>> I can NOT use the block factor in edgeR to remove batch effect since
>>>>>>>> I
>>>>>>>> need
>>>>>>>> to first obtain residuals after adjusting for batch effect, then test
>>>>>>>> the
>>>>>>>> residuals for hundreds of thousands of SNPs (eQTL analysis).
>>>>>>>>
>>>>>>>> My question is how to remove batch effect without using edgeR:
>>>>>>>>
>>>>>>>> 1. is SVA ok for such a small sample size (N=15)?
>>>>>>>> 2. If SVA does not work, any other suggestions?
>>>>>>>>
>>>>>>>> Many thanks,
>>>>>>>> Shirley
>>>>>>>>
>>>>>>>>
>>>>>>>   ____________________________________________________________
>>>>> __________
>>>>> The information in this email is confidential and intended solely for
>>>>> the
>>>>> addressee.
>>>>> You must not disclose, forward, print or use it without the permission
>>>>> of
>>>>> the sender.
>>>>> ______________________________________________________________________
>>>>>
>>>>>
>>>>
>>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor