[BioC] Bioconductor Digest, Vol 124, Issue 5
Johnson, William Evan
wej at bu.edu
Wed Jun 5 14:08:34 CEST 2013
David,
ComBat assumes a normal-normal mixture when doing the adjustment. It should work very well if your log-data are fairly normally distributed (within each gene) and if your sample size is relatively large (say ~15-30). If you have a small sample size or if your data (within each gene) are not normally distributed then I would worry that ComBat might not be appropriate.
Also, there is the non-parametric prior approach to ComBat. It assumes normal data, but lets the empirical prior distribution be unspecified. Try this and see if the results are different--this will tell you if you are robust to the prior specification.
Hope this helps,
Evan
On Jun 5, 2013, at 6:00 AM, <bioconductor-request at r-project.org>
wrote:
> Message: 18
> Date: Tue, 4 Jun 2013 15:40:20 -0400
> From: "David O'Brien" <dobrie10 at jhmi.edu>
> To: bioconductor at r-project.org
> Subject: [BioC] Remove batch effects from RNA-seq data using edgeR and
> sva/ComBat
> Message-ID:
> <CAPRUzNDVHYpd1dSg5m=Ce_Ddw1Nm9fV1Cs3c3kXgMec16NZSGQ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I'm trying to analyze an RNA-seq experiment where the PCA plot shows better
> clustering by the day the experiment was done rather than treatment type.
> Using edgeR to determine differentially expressed genes resulted in less
> than 5 genes with an FDR under 5%. Creating a GLM model to remove batch
> effects for day of experiment as stated in the edgeR manual resulted in 42
> genes with an FDR less than 5%. An improvement, but still not good. So I
> tried using ComBat and the result was 986 genes with an FDR under 5%.
> Looking at the GO enrichment, the differentially expressed genes seem to
> make sense, but since ComBat was developed for microarrays, I'm concerned
> that there may be some caveats with this approach that I'm missing. Looking
> at the top genes below, the log2 fold change is really low and generally
> this just seems too good to be true. So my question is: Are there any
> reasons why using ComBat with RNA-seq data is not legit? And if so, can you
> see any problems with the approach below?
>
> mean_control mean_treatment logFC pval padj
> Gene5727 51.224797 45.371919 -0.1750427 3.474361e-08 0.0003224554
> Gene3059 8.998311 5.740828 -0.6483954 1.056473e-07 0.0003268376
> Gene11899 35.044302 39.027842 0.1553238 7.398559e-08 0.0003268376
> Gene11724 2.556712 3.684178 0.5270535 1.959058e-07 0.0003636404
> Gene12218 30.852989 23.702209 -0.3803888 1.908726e-07 0.0003636404
>
> Gene4952 26.122068 30.466346 0.2219474 3.346424e-07 0.0005176360
>
>
> My code is below. I've attached a file, dge.Rdata, that contains the
> counts info that is output from readDGE, so you can have the initial
> counts info.
>
>
> require(edgeR)
> require(sva)
> source('code/annotate_edgeR.R')
> files = data.frame(files=c('counts.control0', 'counts.control1',
> 'counts.control2', 'counts.control3', 'counts.treatment0',
> 'counts.treatment1', 'counts.treatment2', 'counts.treatment3'),
> group=c('control', 'control', 'control', 'control',
> 'treatment', 'treatment', 'treatment', 'treatment'),
> day=rep(0:3,2)
> )
> labels <- paste0(files$group, files$day)
> dge <- readDGE(files=files, path='data/HTSeq/', labels=labels)
> rownames(dge$counts) <- paste0('Gene', 1:nrow(dge$counts)) #Change gene
> names to anonymize data
> ################################
> # save(dge, file='objs/dge.Rdata')
> # SEE ATTACHED FILE #
> ###############################
>
> ## filter out the no_feature etc. rows
> dge <- dge[1:(nrow(dge)-5), ]
> ## This mitochondrial rRNA gene takes up a massive portion of my libraries
> dge <- dge[!rownames(dge)%in%'Gene13515', ]
> ## filter out lowly expressed genes
> keep <- rowSums(cpm(dge) > 1) >= 3 ## gene has at least 3 columns where cpm
> is > 1
> dge <- dge[keep, ]
> ## Recompute library sizes
> dge$samples$lib.size <- colSums(dge$counts)
> ## Normalize for lib size
> dge <- calcNormFactors(dge)
>
> ## ComBat
> mod <- model.matrix(~as.factor(group), data=dge$sample)
> mod0 <- model.matrix(~1, data=dge$sample)
> batch <- dge$sample$day
>
> combat <- ComBat(dat=cpm(dge), batch=batch, mod=mod)
> pval_combat = f.pvalue(combat, mod, mod0)
> padj_combat = p.adjust(pval_combat, method="BH")
> mean_control <- rowMeans(combat[, 1:4])
> mean_treatment <- rowMeans(combat[, 5:8])
> logFC <- log2(mean_treatment/mean_control)
>
> res <- data.frame(mean_control, mean_treatment, logFC, pval=pval_combat,
> padj=padj_combat)
> res <- res[order(res$padj), ]
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] biomaRt_2.16.0 sva_3.6.0 mgcv_1.7-23 corpcor_1.6.6
> edgeR_3.2.3 limma_3.16.4
>
> loaded via a namespace (and not attached):
> [1] grid_3.0.1 lattice_0.20-15 Matrix_1.0-12 nlme_3.1-109
> RCurl_1.95-4.1 tools_3.0.1
> [7] XML_3.96-1.1
More information about the Bioconductor
mailing list