[BioC] weird model design for DE analysis

Gordon K Smyth smyth at wehi.EDU.AU
Sun Jul 28 05:05:17 CEST 2013

Dear Lorena,

You are right to recognise that this design needs special treatment.

You can analyse all the data together, but you will need a statistical 
procedure that can account for the fact that the control samples in the 
two batches are biologically paired, whereas all the other samples are 
biologically independent.  This requires a method that can fit random 
effects for the individuals or can estimate within individual 

You cannot do this in edgeR, DESeq, DESeq2 or in any of the other RNA-seq 
packages based on the negative binomial distribution.  These packages 
assume that all samples are statistically independent.

The only valid approach that I know of would be to use voom and the 
duplicate correlation approach of the limma package.  First, read in your 
data (for both batches) and normalize your data as in the edgeR package. 
If y is your DGEList object, then

   y <- calcNormFactors(y)

Then setup you design matrix as you would do for limma or edgeR.  This 
includes effects for the batch and treatment, but not effects for 
patients. Something like

   design <- model.matrix(~batch+condition)

where batch takes two levels (A and B) and condition takes three levels 
(control, preclinical, clinical).

Then transform to logCPM using the voom function.

   v <- voom(y, design, plot=TRUE)

Then estimate the correlation between the repeat samples for the same 

   dupcor <- duplicateCorrelation(v, design, block=individual)

Here 'individual' is a vector or factor indicating which individual each 
sample comes from.  It has length equal to the number of samples and takes 
21 levels, one for each individual in your data (7 controls, 7 preclinic 
patients, 7 clinical patients).

Then conduct a differential expression analysis:

   fit <- lmFit(v, design, block=individual, correlation=dupcor$consensus)
   fit <- eBayes(fit)

Be sure to check that the value of dupcor$conensus is positive.  This 
analysis keeps track of the fact that the two samples from each control 
individual are correlated.

Then you can use the topTable() function in the limma package to extract 
genes that follow any pattern of differential expression that you are 
interested in.

Best wishes

Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.

-------------- original message ---------------
[BioC] weird model design for DE analysis
Lorena Pantano lorena.pantano at gmail.com
Fri Jul 26 14:53:36 CEST 2013

Hi, thanks for the quick reply!

The control group are 14 samples, 7 batch A, 7 batch B. But those 7 are 
the same individuals, same RNA. So sample c1 in batch A, is coming from 
same individual than sample c1 in batch B. Hope that is cleat.

And, yes, I also treated as separate level factor, but I want to check 
also this scenario, the increase or decrease of gene expression from 
control -> pre-case -> case.

thanks again!

On Fri, Jul 26, 2013 at 2:29 PM, Michael Love
<michaelisaiahlove at gmail.com>wrote:

>hi Lorena
>Can you clarify on a few points below:
>On Fri, Jul 26, 2013 at 1:29 PM, Lorena Pantano
><lorena.pantano at gmail.com> wrote:
>> Hi,
>> I have some doubts about my data. I would like to do DE analysis of 
>> RNAseq data.
>> I have two set of experiments done in two different time points, so 
>> there is probably batch effect: let's say batch A and B.
>> For A, I have 7 controls and 7 cases (clinical identified) For B, I 
>> have the same 7 controls and other 7 pre-cases (preclinical identified)
>> I wanted to know if I can analyze all together, but I have questions 
>> like:
>> 1-7 cases from A are only in batch A, and the others 7 are only in 
>> batch B, is it correct to setup a blocked variable indicating the batch 
>> effect although I only have one entire group for one batch, and another 
>> entire group for another batch?
> You can still use blocking. Because the control group spans batch A and 
> batch B, you can still estimate the batch effect (the model matrix 
> should still be full rank).
>> 2-and 7 control for A and B are same people, so it is not quite right 
>> to merge all together because it would be more like technical
>> replicates.
> I don't understand this last sentence. Do you mean that some, but not 
> all, of individuals in the control group from batch A are also in batch 
> B?
> I wouldn't say they are technical replicates if the samples are drawn 
> from the same individual. RNA-Seq experiments of the same individual 
> will contain biological variability.
>> My goal is:
>> 1-get DE genes that follow additive effect. Something like increase a
>> little in pre-cases, and increase more in cases.
> It sounds to me like you might want to treat pre-cases and cases as 
> separate levels of a factor variable, rather than assume that the trend 
> will necessarily be: control < pre-case < case. But to receive more 
> specific advice, you should probably explain more about the share of 
> individuals across the 28 samples.
> Mike

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list