[BioC] edgeR for data combined from different studies and/or platforms
anders at embl.de
Wed Mar 20 10:16:30 CET 2013
Thanks for the additional information.
Combining data from different studies is always quite risky. When you
see differences between your two cancer types, you will not be able to
say for sure whether these are due to the fact that it's different
cancer types or that it's different labs.
On the other hand, I agree with you that comparing the healthy samples
produced by the two labs is a reasonable approach to show that the lab
effects are small and well controlled. The only difficulty here is that
in such a test, the lab effects have to be large compared to the
differences between individuals within each data set.
In your actual comparison, on the other hand, you compare tumour and
healthy tissue from the same individual, which is a setting with much
more inferential power. So, a batch or lab effect might be strong enough
to be large compared to the noise in the paired comparisons but too
small to appear significant in the un-paired comparisons of the healthy
On the third hand, however, the usual method of correcting for lab or
batch effects is to include a blocking factor in your linear model, and
this assumes that the batch effect is additive. Once you accept this
assumption (which may be questionable but is standard practice) you
don't need to account for batch effects at all in a paired comparison
between tumours and healthy tissue as long as both the tumour and the
control sample from each subject have always been processed in the same
lab (because then, any additive lab effect cancels out when looking at
So, considering all this, I'd say, go ahead with your comparison but
make sure that all your tests are paired.
So, you would make a design table with one row for each sample and three
columns: subject (one level for each subject, IDs running over all
subjects from both studies), disease state (two levels: healthy control
or tumour tissue) and cancer type (two levels: cancer A or cancer B;
this is the cancer of the subject, not the sample, so the healthy tissue
samples get a cancer type assigned, too).
Now, you fit a reduced model
count ~ subject + disease_state + cancer_type
and a full model
count ~ subject + disease_state + cancer_type +
and compare them to test for significance of the interaction term (which
indicates that the difference between tumour and control tissue differs
between cancers for the tested gene).
(The formula notation is with DESeq in mind. In DESeq2, you only fit the
second model and then do a Wald test for the interaction coefficient, as
describe in the vignette. For edgeR, IIRC, you also just fit the full
model and then get a p value for the last coefficient, which should be
the interaction coefficient.)
More information about the Bioconductor