[BioC] Remove batch effect in small RNASeq study (SVA or others?)

shirley zhang shirley0818 at gmail.com
Tue Apr 29 03:59:02 CEST 2014


 Dear Dr. Smyth,

Thank you very much for taking your time to look through my codes, and also
provided an more approciate code sequence. Thank you!

By following your codes, I think the batch effect was removed efficiently
as shown in the attached figures. Do you agree?

Furthermore, I found a very useful paper you published in Nature Protocol
2013.
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

In the paper, you wrote:
d = DGEList(counts = count, group = samples$condition)
d = calcNormFactors(d)
nc = cpm(d, normalized.lib.sizes = TRUE)

My question is:
In my case, should I add option "normalized.lib.sizes = TRUE" in cpm()
after calling calcNormFactors() like the following:
dge <- DGEList(counts=count)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge,log=TRUE,prior.count=5, normalized.lib.sizes = TRUE)

Many thanks for your help,
Shirley



On Mon, Apr 28, 2014 at 8:06 PM, Gordon K Smyth <smyth at wehi.edu.au> wrote:

> To add colours to the MDS plot:
>
>  plotMDS(logCPM, col=as.numeric(batch))
>
> Gordon
>
>
> On Tue, 29 Apr 2014, Gordon K Smyth wrote:
>
>  Dear Shirley,
>>
>> I don't think that scaling in prcomp() is appropriate here.
>>
>> Better would be:
>>
>> library(edgeR)
>> dge <- DGEList(counts=count)
>> dge <- calcNormFactors(dge)
>> logCPM <- cpm(dge,log=TRUE,prior.count=5)
>> plotMDS(logCPM)
>> logCPM <- removeBatchEffect(logCPM,batch=batch)
>> plotMDS(logCPM)
>>
>> Best wishes
>> Gordon
>>
>>
>> On Mon, 28 Apr 2014, shirley zhang wrote:
>>
>>  Dear Dr. Smyth,
>>>
>>> Thanks for your reply. Here is the code sequence I used to remove the
>>> batch
>>> effect and to make the PCA plot.
>>>
>>> First,  getting raw counts for each feature per sample using htseq-count
>>> (~64K features by using Ensemble.gtf)
>>> Then, getting a count data.matrix by combing all samples together (64K
>>> rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.
>>>
>>>  count = cbind(count.s1, count.s2, ...., count.s14)
>>>>
>>>
>>> #remove the batch effect
>>> library(edgeR)
>>> batch = as.factor(c(rep("b1", 8), rep("b2", 6)))
>>>
>>> logCPM <- cpm(count,log=TRUE,prior.count=5)
>>> logCPM <- removeBatchEffect(logCPM, batch=batch)
>>>
>>> #PCA
>>> B.res = prcomp(logCPM, scale = TRUE)
>>> pc.s = summary(.Bres)$importance[2,1:2]
>>> pc1.var = round(pc.s[["PC1"]],2)
>>> pc2.var = round(pc.s[["PC2"]],2)
>>>
>>> pdf(file = "all.count.logCPM.rmBatch.pca")
>>> plot(B.res$rotation[,1:2], main = maintitle,  xlab = paste("PC1
>>> (variance:
>>> ",pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance:
>>> ",pc2.var*100,
>>> "%)", sep =""))
>>> dev.off()
>>>
>>>  sessionInfo()
>>>>
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] edgeR_3.4.0  limma_3.18.7
>>>
>>>
>>> Many thanks.
>>> Shirley
>>>
>>>
>>>
>>>
>>> On Mon, Apr 28, 2014 at 6:31 AM, Gordon K Smyth <smyth at wehi.edu.au>
>>> wrote:
>>>
>>>
>>>>  Date: Sun, 27 Apr 2014 20:32:55 -0400
>>>>
>>>>> From: shirley zhang <shirley0818 at gmail.com>
>>>>> To: Gordon K Smyth <smyth at wehi.edu.au>
>>>>> Cc: Bioconductor mailing list <bioconductor at r-project.org>
>>>>> Subject: Re: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>>         others?)
>>>>>
>>>>> Dear Dr. Smyth,
>>>>>
>>>>> Thank you very much for your quick reply. I did as you suggested by
>>>>> first
>>>>> getting log CPM value, then call removeBatchEffect(). I found the PCA
>>>>> figure looks better than before, but there is still a batch effect.
>>>>>
>>>>>
>>>> I don't see how there could still be a batch effect.
>>>>
>>>> Please give the code sequence you used to remove the batch effect and to
>>>> make the PCA plot.
>>>>
>>>>  I attached two PCA figures. One is based on log10(raw count) which is
>>>>
>>>>> before calling cpm() and removeBatchEffect(). Another one is after.
>>>>>
>>>>> Could you look at them and give me more suggestions. Will a quantile
>>>>> normalization across samples be a good idea since CPM() is still a
>>>>> normalization only within each sample??
>>>>>
>>>>>
>>>> You should normalize the data before using removeBatchEffect(), and
>>>> quantile is one possibility.
>>>>
>>>> Gordon
>>>>
>>>>  Thanks again for your help,
>>>>
>>>>> Shirley
>>>>>
>>>>>
>>>>> On Sun, Apr 27, 2014 at 6:54 PM, Gordon K Smyth <smyth at wehi.edu.au>
>>>>> wrote:
>>>>>
>>>>>  Dear Shirley,
>>>>>
>>>>>>
>>>>>> I would probably do it like this:
>>>>>>
>>>>>>   library(edgeR)
>>>>>>   logCPM <- cpm(y,log=TRUE,prior.count=5)
>>>>>>   logCPM <- removeBatchEffect(logCPM, batch=batch)
>>>>>>
>>>>>> Best wishes
>>>>>> Gordon
>>>>>>
>>>>>>  Date: Sat, 26 Apr 2014 10:51:23 -0400
>>>>>>
>>>>>>  From: shirley zhang <shirley0818 at gmail.com>
>>>>>>> To: Bioconductor Mailing List <bioconductor at stat.math.ethz.ch>
>>>>>>> Subject: [BioC] Remove batch effect in small RNASeq study (SVA or
>>>>>>>         others?)
>>>>>>>
>>>>>>> I have a RNASeq paired-end data from two batches (8 samples from
>>>>>>> batch1,
>>>>>>> and 7 samples from batch2). After alignment using TopHat2, then I got
>>>>>>> count
>>>>>>> using HTseq-count, and FPKM value via Cufflinks. A big batch effect
>>>>>>> can
>>>>>>> be
>>>>>>> viewed in PCA using both log10(raw count) and log10(FPKM) value.
>>>>>>>
>>>>>>> I can NOT use the block factor in edgeR to remove batch effect since
>>>>>>> I
>>>>>>> need
>>>>>>> to first obtain residuals after adjusting for batch effect, then test
>>>>>>> the
>>>>>>> residuals for hundreds of thousands of SNPs (eQTL analysis).
>>>>>>>
>>>>>>> My question is how to remove batch effect without using edgeR:
>>>>>>>
>>>>>>> 1. is SVA ok for such a small sample size (N=15)?
>>>>>>> 2. If SVA does not work, any other suggestions?
>>>>>>>
>>>>>>> Many thanks,
>>>>>>> Shirley
>>>>>>>
>>>>>>>
>>>>>>  ____________________________________________________________
>>>> __________
>>>> The information in this email is confidential and intended solely for
>>>> the
>>>> addressee.
>>>> You must not disclose, forward, print or use it without the permission
>>>> of
>>>> the sender.
>>>> ______________________________________________________________________
>>>>
>>>>
>>>
>>
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the
> addressee.
> You must not disclose, forward, print or use it without the permission of
> the sender.
> ______________________________________________________________________
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.plotMDS.pdf
Type: application/pdf
Size: 4598 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140428/5c459451/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.rmBatch.plotMDS.pdf
Type: application/pdf
Size: 4584 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140428/5c459451/attachment-0001.pdf>


More information about the Bioconductor mailing list