[BioC] Bioconductor Digest, Vol 124, Issue 5

Wed Jun 5 14:08:34 CEST 2013

David, 

ComBat assumes a normal-normal mixture when doing the adjustment. It should work very well if your log-data are fairly normally distributed (within each gene) and if your sample size is relatively large (say ~15-30). If you have a small sample size or if your data (within each gene) are not normally distributed then I would worry that ComBat might not be appropriate.

Also, there is the non-parametric prior approach to ComBat. It assumes normal data, but lets the empirical prior distribution be unspecified. Try this and see if the results are different--this will tell you if you are robust to the prior specification.

Hope this helps, 

Evan

On Jun 5, 2013, at 6:00 AM, <bioconductor-request at r-project.org>
 wrote:

> Message: 18
> Date: Tue, 4 Jun 2013 15:40:20 -0400
> From: "David O'Brien" <dobrie10 at jhmi.edu>
> To: bioconductor at r-project.org
> Subject: [BioC] Remove batch effects from RNA-seq data using edgeR and
> 	sva/ComBat
> Message-ID:
> 	<CAPRUzNDVHYpd1dSg5m=Ce_Ddw1Nm9fV1Cs3c3kXgMec16NZSGQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> I'm trying to analyze an RNA-seq experiment where the PCA plot shows better
> clustering by the day the experiment was done rather than treatment type.
> Using edgeR to determine differentially expressed genes resulted in less
> than 5 genes with an FDR under 5%. Creating a GLM model to remove batch
> effects for day of experiment as stated in the edgeR manual resulted in 42
> genes with an FDR less than 5%. An improvement, but still not good. So I
> tried using ComBat and the result was 986 genes with an FDR under 5%.
> Looking at the GO enrichment, the differentially expressed genes seem to
> make sense, but since ComBat was developed for microarrays, I'm concerned
> that there may be some caveats with this approach that I'm missing. Looking
> at the top genes below, the log2 fold change is really low and generally
> this just seems too good to be true. So my question is: Are there any
> reasons why using ComBat with RNA-seq data is not legit? And if so, can you
> see any problems with the approach below?
> 
>          mean_control mean_treatment      logFC         pval         padj
> Gene5727     51.224797      45.371919 -0.1750427 3.474361e-08 0.0003224554
> Gene3059      8.998311       5.740828 -0.6483954 1.056473e-07 0.0003268376
> Gene11899    35.044302      39.027842  0.1553238 7.398559e-08 0.0003268376
> Gene11724     2.556712       3.684178  0.5270535 1.959058e-07 0.0003636404
> Gene12218    30.852989      23.702209 -0.3803888 1.908726e-07 0.0003636404
> 
> Gene4952     26.122068      30.466346  0.2219474 3.346424e-07 0.0005176360
> 
> 
> My code is below. I've attached a file, dge.Rdata, that contains the
> counts info that is output from readDGE, so you can have the initial
> counts info.
> 
> 
> require(edgeR)
> require(sva)
> source('code/annotate_edgeR.R')
> files = data.frame(files=c('counts.control0', 'counts.control1',
> 'counts.control2', 'counts.control3', 'counts.treatment0',
> 'counts.treatment1', 'counts.treatment2', 'counts.treatment3'),
>                   group=c('control', 'control', 'control', 'control',
> 'treatment', 'treatment', 'treatment', 'treatment'),
>                   day=rep(0:3,2)
> )
> labels <- paste0(files$group, files$day)
> dge <- readDGE(files=files, path='data/HTSeq/', labels=labels)
> rownames(dge$counts) <- paste0('Gene', 1:nrow(dge$counts)) #Change gene
> names to anonymize data
> ################################
> # save(dge, file='objs/dge.Rdata')
> #     SEE ATTACHED FILE     #
> ###############################
> 
> ##  filter out the no_feature etc. rows
> dge <- dge[1:(nrow(dge)-5), ]
> ##  This mitochondrial rRNA gene takes up a massive portion of my libraries
> dge <- dge[!rownames(dge)%in%'Gene13515', ]
> ##  filter out lowly expressed genes
> keep <- rowSums(cpm(dge) > 1) >= 3 ## gene has at least 3 columns where cpm
> is > 1
> dge <- dge[keep, ]
> ##  Recompute library sizes
> dge$samples$lib.size <- colSums(dge$counts)
> ##  Normalize for lib size
> dge <- calcNormFactors(dge)
> 
> ## ComBat
> mod <- model.matrix(~as.factor(group), data=dge$sample)
> mod0 <- model.matrix(~1, data=dge$sample)
> batch <- dge$sample$day
> 
> combat <- ComBat(dat=cpm(dge), batch=batch, mod=mod)
> pval_combat = f.pvalue(combat, mod, mod0)
> padj_combat = p.adjust(pval_combat, method="BH")
> mean_control <- rowMeans(combat[, 1:4])
> mean_treatment <- rowMeans(combat[, 5:8])
> logFC <- log2(mean_treatment/mean_control)
> 
> res <- data.frame(mean_control, mean_treatment, logFC, pval=pval_combat,
> padj=padj_combat)
> res <- res[order(res$padj), ]
> 
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C                 LC_NAME=C
> LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] biomaRt_2.16.0 sva_3.6.0      mgcv_1.7-23    corpcor_1.6.6
> edgeR_3.2.3    limma_3.16.4
> 
> loaded via a namespace (and not attached):
> [1] grid_3.0.1      lattice_0.20-15 Matrix_1.0-12   nlme_3.1-109
> RCurl_1.95-4.1  tools_3.0.1
> [7] XML_3.96-1.1