[BioC] Correcting for batch effect between two batches of interleaved timepoints
Ryan C. Thompson
rct at thompsonclan.org
Fri Jul 12 21:37:36 CEST 2013
Hi all,
I am using edgeR to study an RNA-seq dataset of two human cell culture
types across 4 time points after treatment. The time points were done in
two batches of two timepoints each, several months apart, and the two
batches were done completely separately (i.e. all the wet lab was done
at different times, not just the library prep/sequencing). As a result,
there is a significant batch effect, clearly visible in an MDS plot of
edgeR-calculated logCPM values, and inter-batch contrasts always have
many more differentially expressed genes than intra-batch contrasts,
regardless of whether I use edgeR LRT, edgeR QLFT, or limma voom.
So, normally I would just say that there's no hope for doing inter-batch
comparisons with this dataset, but there is one saving grace: the
batched time points are interleaved: batch A consists of time points 0
and 2, while batch B consists of time points 1 and 3. This, combined
with the nature of the experimental system, makes it extremely unlikely
that the batch effect coincides with any real biological pattern of gene
expression (it wouldn't make biological sense for many genes to go up,
then down, then up again). With this in mind, I tried a fairly simple
method to correct for the batch effect. First, I used edgeR to fit a
model with only the batch effect ( i.e. model.matrix(~Batch) ). Then, I
split the genes into 100 quantile bins by abundance and took the 20
lowest-dispersion genes from each bin, thus giving me a list of 2000
genes across the abundance spectrum with low intra-batch variability. I
then computed the library size normalization factors from only these
genes and used these normalization factors for the full dataset.
Essentially, I assume that low intra-batch variability implies low
inter-batch variability.
This procedure seems to have greatly reduced the batch effect, and has
allowed me to give some useful results. This gives me confidence that I
have the right idea. However, I feel I could possibly do better if I
allowed for an abundance trend in the normalization factors (i.e. I
would compute an offset matrix rather than per-sample normalization
factors). Unfortunately, this is getting somewhat out of my depth, so I
figured I would ask here if anyone could suggest how to go about
computing such an offset matrix, or if there is some other easier method
I'm not thinking of that could solve my problem.
Any advice would be highly appreciated.
Data and code here: https://www.dropbox.com/sh/1w6fe5gz3hueayt/RJ84jWSsgT
The test data contains a design and a DGEList object containing all the
data, suitably anonymized. The DGEList already has norm factors and
dispersions pre-calculated. The code demonstrates my method of selecting
low-dispersion genes from 100 abundance bins and using them to
recalculate the normalization factors. It then plots the old norm
factors against the new ones and against the delta to see what effect
there is.
Thanks,
-Ryan Thompson
More information about the Bioconductor
mailing list