[BioC] Correcting for batch effect between two batches of interleaved timepoints

Fri Jul 12 21:37:36 CEST 2013

Hi all,

I am using edgeR to study an RNA-seq dataset of two human cell culture 
types across 4 time points after treatment. The time points were done in 
two batches of two timepoints each, several months apart, and the two 
batches were done completely separately (i.e. all the wet lab was done 
at different times, not just the library prep/sequencing). As a result, 
there is a significant batch effect, clearly visible in an MDS plot of 
edgeR-calculated logCPM values, and inter-batch contrasts always have 
many more differentially expressed genes than intra-batch contrasts, 
regardless of whether I use edgeR LRT, edgeR QLFT, or limma voom.

So, normally I would just say that there's no hope for doing inter-batch 
comparisons with this dataset, but there is one saving grace: the 
batched time points are interleaved: batch A consists of time points 0 
and 2, while batch B consists of time points 1 and 3. This, combined 
with the nature of the experimental system, makes it extremely unlikely 
that the batch effect coincides with any real biological pattern of gene 
expression (it wouldn't make biological sense for many genes to go up, 
then down, then up again). With this in mind, I tried a fairly simple 
method to correct for the batch effect. First, I used edgeR to fit a 
model with only the batch effect ( i.e. model.matrix(~Batch) ). Then, I 
split the genes into 100 quantile bins by abundance and took the 20 
lowest-dispersion genes from each bin, thus giving me a list of 2000 
genes across the abundance spectrum with low intra-batch variability. I 
then computed the library size normalization factors from only these 
genes and used these normalization factors for the full dataset. 
Essentially, I assume that low intra-batch variability implies low 
inter-batch variability.

This procedure seems to have greatly reduced the batch effect, and has 
allowed me to give some useful results. This gives me confidence that I 
have the right idea. However, I feel I could possibly do better if I 
allowed for an abundance trend in the normalization factors (i.e. I 
would compute an offset matrix rather than per-sample normalization 
factors). Unfortunately, this is getting somewhat out of my depth, so I 
figured I would ask here if anyone could suggest how to go about 
computing such an offset matrix, or if there is some other easier method 
I'm not thinking of that could solve my problem.

Any advice would be highly appreciated.

Data and code here: https://www.dropbox.com/sh/1w6fe5gz3hueayt/RJ84jWSsgT

The test data contains a design and a DGEList object containing all the 
data, suitably anonymized. The DGEList already has norm factors and 
dispersions pre-calculated. The code demonstrates my method of selecting 
low-dispersion genes from 100 abundance bins and using them to 
recalculate the normalization factors. It then plots the old norm 
factors against the new ones and against the delta to see what effect 
there is.

Thanks,

-Ryan Thompson