[BioC] edgeR for data combined from different studies and/or platforms

Wed Mar 20 10:16:30 CET 2013

Hi Santosh

Thanks for the additional information.

Combining data from different studies is always quite risky. When you 
see differences between your two cancer types, you will not be able to 
say for sure whether these are due to the fact that it's different 
cancer types or that it's different labs.

On the other hand, I agree with you that comparing the healthy samples 
produced by the two labs is a reasonable approach to show that the lab 
effects are small and well controlled. The only difficulty here is that 
in such a test, the lab effects have to be large compared to the 
differences between individuals within each data set.

In your actual comparison, on the other hand, you compare tumour and 
healthy tissue from the same individual, which is a setting with much 
more inferential power. So, a batch or lab effect might be strong enough 
to be large compared to the noise in the paired comparisons but too 
small to appear significant in the un-paired comparisons of the healthy 
controls.

On the third hand, however, the usual method of correcting for lab or 
batch effects is to include a blocking factor in your linear model, and 
this assumes that the batch effect is additive. Once you accept this 
assumption (which may be questionable but is standard practice) you 
don't need to account for batch effects at all in a paired comparison 
between tumours and healthy tissue as long as both the tumour and the 
control sample from each subject have always been processed in the same 
lab (because then, any additive lab effect cancels out when looking at 
tumour-control differences).

So, considering all this, I'd say, go ahead with your comparison but 
make sure that all your tests are paired.

So, you would make a design table with one row for each sample and three 
columns: subject (one level for each subject, IDs running over all 
subjects from both studies), disease state (two levels: healthy control 
or tumour tissue) and cancer type (two levels: cancer A or cancer B; 
this is the cancer of the subject, not the sample, so the healthy tissue 
samples get a cancer type assigned, too).

Now, you fit a reduced model

count ~ subject + disease_state + cancer_type

and a full model

count ~ subject + disease_state + cancer_type +
    disease_state:cancer_type

and compare them to test for significance of the interaction term (which 
indicates that the difference between tumour and control tissue differs 
between cancers for the tested gene).

(The formula notation is with DESeq in mind. In DESeq2, you only fit the 
second model and then do a Wald test for the interaction coefficient, as 
describe in the vignette. For edgeR, IIRC, you also just fit the full 
model and then get a p value for the last coefficient, which should be 
the interaction coefficient.)

   Simon