[BioC] weird model design for DE analysis

Gordon K Smyth smyth at wehi.EDU.AU
Sun Jul 28 05:05:17 CEST 2013

```Dear Lorena,

You are right to recognise that this design needs special treatment.

You can analyse all the data together, but you will need a statistical
procedure that can account for the fact that the control samples in the
two batches are biologically paired, whereas all the other samples are
biologically independent.  This requires a method that can fit random
effects for the individuals or can estimate within individual
correlations.

You cannot do this in edgeR, DESeq, DESeq2 or in any of the other RNA-seq
packages based on the negative binomial distribution.  These packages
assume that all samples are statistically independent.

The only valid approach that I know of would be to use voom and the
data (for both batches) and normalize your data as in the edgeR package.
If y is your DGEList object, then

y <- calcNormFactors(y)

Then setup you design matrix as you would do for limma or edgeR.  This
includes effects for the batch and treatment, but not effects for
patients. Something like

design <- model.matrix(~batch+condition)

where batch takes two levels (A and B) and condition takes three levels
(control, preclinical, clinical).

Then transform to logCPM using the voom function.

v <- voom(y, design, plot=TRUE)

Then estimate the correlation between the repeat samples for the same
individual:

dupcor <- duplicateCorrelation(v, design, block=individual)

Here 'individual' is a vector or factor indicating which individual each
sample comes from.  It has length equal to the number of samples and takes
21 levels, one for each individual in your data (7 controls, 7 preclinic
patients, 7 clinical patients).

Then conduct a differential expression analysis:

fit <- lmFit(v, design, block=individual, correlation=dupcor\$consensus)
fit <- eBayes(fit)

Be sure to check that the value of dupcor\$conensus is positive.  This
analysis keeps track of the fact that the two samples from each control
individual are correlated.

Then you can use the topTable() function in the limma package to extract
genes that follow any pattern of differential expression that you are
interested in.

Best wishes
Gordon

Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

-------------- original message ---------------
[BioC] weird model design for DE analysis
Lorena Pantano lorena.pantano at gmail.com
Fri Jul 26 14:53:36 CEST 2013

Hi, thanks for the quick reply!

The control group are 14 samples, 7 batch A, 7 batch B. But those 7 are
the same individuals, same RNA. So sample c1 in batch A, is coming from
same individual than sample c1 in batch B. Hope that is cleat.

And, yes, I also treated as separate level factor, but I want to check
also this scenario, the increase or decrease of gene expression from
control -> pre-case -> case.

thanks again!
Lo

On Fri, Jul 26, 2013 at 2:29 PM, Michael Love
<michaelisaiahlove at gmail.com>wrote:

>hi Lorena
>
>Can you clarify on a few points below:
>
>On Fri, Jul 26, 2013 at 1:29 PM, Lorena Pantano
><lorena.pantano at gmail.com> wrote:
>> Hi,
>>
>> I have some doubts about my data. I would like to do DE analysis of
>> RNAseq data.
>>
>> I have two set of experiments done in two different time points, so
>> there is probably batch effect: let's say batch A and B.
>>
>> For A, I have 7 controls and 7 cases (clinical identified) For B, I
>> have the same 7 controls and other 7 pre-cases (preclinical identified)
>>
>> I wanted to know if I can analyze all together, but I have questions
>> like:
>>
>> 1-7 cases from A are only in batch A, and the others 7 are only in
>> batch B, is it correct to setup a blocked variable indicating the batch
>> effect although I only have one entire group for one batch, and another
>> entire group for another batch?
>
> You can still use blocking. Because the control group spans batch A and
> batch B, you can still estimate the batch effect (the model matrix
> should still be full rank).
>
>> 2-and 7 control for A and B are same people, so it is not quite right
>> to merge all together because it would be more like technical
>> replicates.
>
> I don't understand this last sentence. Do you mean that some, but not
> all, of individuals in the control group from batch A are also in batch
> B?
>
> I wouldn't say they are technical replicates if the samples are drawn
> from the same individual. RNA-Seq experiments of the same individual
> will contain biological variability.
>
>> My goal is:
>>
>> 1-get DE genes that follow additive effect. Something like increase a
>> little in pre-cases, and increase more in cases.
>
> It sounds to me like you might want to treat pre-cases and cases as
> separate levels of a factor variable, rather than assume that the trend
> will necessarily be: control < pre-case < case. But to receive more
> specific advice, you should probably explain more about the share of
> individuals across the 28 samples.
>
> Mike

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}

```