[BioC] How to design matrix on edgeR to study genotype x environmental interaction
Gordon K Smyth
smyth at wehi.EDU.AU
Fri Nov 23 04:27:22 CET 2012
Dear Daniela,
I think you would be very well advised to seek out a statistical
bioinformatician with whom you can collaborate on an ongoing basis. A GxE
anova analysis would be statistically sophisticated even if you were
analysing a simple univariate phenotypic trait. Attempting to do that
sort of analysis in the context of an RNA-Seq experiment on miRNAs is far
more difficult again. The design matrices you have created may be
correct, but that's just the start of the analysis, and there are many
layers of possible complexity.
The BCV in your experiment is so large that I feel there must be quality
issues with your data that you have not successfully dealt with. It seems
very likely, for example, that there are batch effects that you have not
yet described.
To answer some specific questions:
You might be better off with prior.df=10 instead the default, but this has
little to do with the size of the BCV.
You ask why one variety and one stage are disappearing from your design
matrix. If you omit the "0+" in the first formula (and you should), you
will find that one vineyard will disappear as well. This is because the
number of contrasts for any factor must be one less than the number of
leveles. This is a very fundamental feature of factors and model formula
that you need to become familiar with before you can make sense of any
model formula.
Your email makes no mention of library sizes or sequencing depths, but
obviously that has a fundamental effect on what is significantly different
from what.
I think you know now how to use edgeR in principle. However, as you
probably already appreciate, deciding what is the right analysis for your
data is beyond the scope of the mailing list.
Best wishes
Gordon
On Thu, 22 Nov 2012, bioconductor-request at r-project.org wrote:
> Date: Thu, 22 Nov 2012 10:07:19 +0100
> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at sssup.it>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] How to design matrix on edgeR to study genotype x
> environmental interaction
> Message-ID:
>
> Dear Gordon,
>
> Thank you so much for your valuable input. I took sometime to study a bit
> more and be able to consider all the aspects you pointed out. At this time
> I reconsider the analysis and started again, with the data exploration of
> all 48 samples.
>
> First I filtered out the low reads, considering just the ones with more
> than 1 cpm in at least 2 libraries (I have two replicates of each library);
> the MDS plot clearly separate one of the locations from the other two
> (dimension 1) and with less distinction the two varieties (dimension 2).
> The stages also seems to be separated in two groups (the first two ones
> together and separate of the two last ones) but as the varieties, not so
> distinct. The two replicates are also consistent.
>
> With the BCV plot I could observe that reads with lower logCPM have bigger
> BCV (the BCV value was equal to 0.5941), and then comes my first question:
>
> Should I choose *prior.df* different from the default, due to this
> behavior, when estimating genewise dispersion?
>
> To proceed with the DE analysis, I tried two approaches, this time with all
> the 48 samples, as suggested.
> For both approaches, I have the following data frame:
>
>> target
> Sample Vineyard Variety Stage
> 1 1 mont CS ps
> 2 2 mont CS ps
> 3 4 mont CS bc
> 4 5 mont CS bc
> 5 7 mont CS 19b
> 6 8 mont CS 19b
> 7 10 mont CS hv
> 8 11 mont CS hv
> 9 13 mont SG ps
> 10 14 mont SG ps
> 11 16 mont SG bc
> 12 17 mont SG bc
> 13 19 mont SG 19b
> 14 20 mont SG 19b
> 15 22 mont SG hv
> 16 23 mont SG hv
> 17 25 Bol CS ps
> 18 26 Bol CS ps
> 19 28 Bol CS bc
> 20 29 Bol CS bc
> 21 31 Bol CS 19b
> 22 32 Bol CS 19b
> 23 34 Bol CS hv
> 24 35 Bol CS hv
> 25 37 Bol SG ps
> 26 38 Bol SG ps
> 27 40 Bol SG bc
> 28 41 Bol SG bc
> 29 43 Bol SG 19b
> 30 44 Bol SG 19b
> 31 46 Bol SG hv
> 32 47 Bol SG hv
> 33 49 Ric CS ps
> 34 50 Ric CS ps
> 35 52 Ric CS bc
> 36 53 Ric CS bc
> 37 55 Ric CS 19b
> 38 56 Ric CS 19b
> 39 58 Ric CS hv
> 40 59 Ric CS hv
> 41 61 Ric SG ps
> 42 62 Ric SG ps
> 43 64 Ric SG bc
> 44 65 Ric SG bc
> 45 67 Ric SG 19b
> 46 68 Ric SG 19b
> 47 70 Ric SG hv
> 48 71 Ric SG hv
>
> At the first instance, I used the full interaction formula as the following
> code:
>
>> d <- DGEList(counts=file)
>> keep <- rowSums(cpm(DGElist) > 1) >= 2
>> DGElist <- DGElist[keep,]
>> DGElist$samples$lib.size <- colSums(DGElist$counts)
>> DGElist_norm <- calcNormFactors(DGElist)
> *> design <- model.matrix(~0 + Vineyard + Variety + Stage +
> Vineyard:Variety + Vineyard:Stage + Variety:Stage + Vineyard:Variety:Stage,
> data=target)*
>
> [or even (*> design <- model.matrix(~0 + Vineyard*Variety*Stage,
> data=target)*) which gives the same result]
>
>> rownames(design) <- colnames(DGEList_norm)
>
> However, when I call the *design* I see that one Variety (i.e., CS) and one
> Stage (i.e., 19b) are not present in the design matrix, as individual
> effect or even in the interactions.
>
> Then I passed to the second approach, in which, I create groups:
>
>> group <-
> factor(paste(target$Vineyard,target$Variety,target$Stage,sep="_"))
>> cbind(target,Group=group)
>> DGElist <- DGEList(counts=file,group=group)
>> keep <- rowSums(cpm(DGElist) > 1) >= 2
>> DGElist <- DGElist[keep,]
>> DGElist$samples$lib.size <- colSums(DGElist$counts)
>> DGElist_norm <- calcNormFactors(DGElist)
>> design <- model.matrix(~0+group, data=DGElist_norm$samples)
>> colnames(design) <- levels(group)
>
> The design matrix in this case include all the groups, and then I proceed
> doing:
>
>> commondisp <- estimateGLMCommonDisp(DGElist_norm, design, verbose=TRUE)
> Disp = 0.35294 , BCV = 0.5941
>> trenddisp <- estimateGLMTrendedDisp(commondisp, design)
>> tagwisedisp <- estimateGLMTagwiseDisp(trenddisp, design)
>> fit <- glmFit(tagwisedisp, design)
>> my.contrasts <- makeContrasts(CS_ps_BolvsMont = Bol_CS_ps-mont_CS_ps,
> CS_ps_BolvsRic = Bol_CS_ps-Ric_CS_ps, Bol_ps_CSvsSG = Bol_CS_ps-Bol_SG_ps,
> levels=design) #Just as some examples of the contrasts I am interested on.
>> lrt <- glmLRT(fit, contrast=my.contrasts[,"CS_ps_BolvsMont"])
>
> With this code, I got the results, but I am afraid that they are not very
> consistent with the data. To give one example, the DE results tell me that
> a given miRNA which has 0 and 1 reads respectively in the two replicates of
> one sample is significantly different when comparing with other sample in
> which this miRNA has 5 and 10 reads in the two replicates respectively,
> but in the same set of results another miRNA which has 4259 and 2198 reads
> respectively in the two replicates of one sample is not significantly
> different when comparing with the other sample in which this miRNA has
> 352 and 599 reads respectively in the two replicates. In other words, 0 and
> 1 are significantly different from 5 and 10 but 4259 and 2198 are
> not significantly different from 352 and 599. With this comparisons, I am
> just trying to interpret my data based on these results.
>
> I know that the test for differential expression is not made based on the
> raw reads, but I do not know exactly how it is made, anyway I expect that
> if I used the correct model to describe my data, the results will describe
> the differences consistently.
> Could you make any suggestions about my analysis? Creating the groups as I
> showed above, is it correct for testing all the interactions? Is there any
> explanation for the fact that the one variety and one stage "disappear"
> from the design matrix when using the full interaction formula?
>
> Sorry for the long email and thank you for all the advises,
>
> Best wishes
>
> Daniela Lopes Paim Pinto
> PhD student - Agrobiosciences
> Scuola Superiore Sant'Anna, Italy
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] edgeR_3.0.3 limma_3.14.1
>
> loaded via a namespace (and not attached):
> [1] tools_2.15.2
>
>
>
>
>
>
>
>
>
>
> 2012/11/11 Gordon K Smyth <smyth at wehi.edu.au>
>
>> Dear Daniela,
>>
>> What version of the edgeR are you using? The posting guide asks you to
>> give sessionInfo() output so we can see package versions.
>>
>> Your codes looks correct for testing an interaction, although you could
>> estimate the same interaction more directly using an interaction formula as
>> in Section 3.3.4 of the edgeR User's Guide.
>>
>> However the model you have used is correct only if all 12 samples
>> correspond to the same physiological stage. I wonder why you are not
>> analysing all the 48 samples together. I would start with data exploration
>> of all 48 samples, including exploration measures like transcript
>> filtering, library sizes, normalization factors, an MDS plot, a BCV plot,
>> and so on. The first step is to check the data quality before going on to
>> test for differential expression.
>>
>> edgeR has very high statistical power, even giving p-values smaller than I
>> would like in some cases. So if you're not getting any differential
>> expression, it is because there is none or because you have data quality
>> problems.
>>
>> Best wishes
>> Gordon
>>
>> Date: Fri, 9 Nov 2012 14:44:28 +0100
>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at sssup.it>
>>> To: bioconductor at r-project.org
>>> Subject: Re: [BioC] How to design matrix on edgeR to study genotype x
>>> environmental interaction
>>>
>>> Dear Gordon,
>>>
>>> Thank you so much for the reference. I read all the chapter regarding to
>>> the models and I tried to set up the following code considering a data
>>> frame like this:
>>>
>>> target
>>>>
>>> Sample Variety Location
>>> 1 1 CS Mont
>>> 2 2 CS Mont
>>> 3 25 CS Bol
>>> 4 26 CS Bol
>>> 5 49 CS Ric
>>> 6 50 CS Ric
>>> 7 13 SG Mont
>>> 8 14 SG Mont
>>> 9 37 SG Bol
>>> 10 38 SG Bol
>>> 11 61 SG Ric
>>> 12 62 SG Ric
>>>
>>> group <- factor(paste(target$Variety,**target$Location,sep="_"))
>>>> cbind(target,Group=group)
>>>> d <- DGEList(counts=file,group=**group)
>>>> DGEnorm <- calcNormFactors(d)
>>>> design <- model.matrix(~0+group, data=DGEnorm$samples)
>>>> colnames(design) <- levels(group)
>>>>
>>>
>>> Which gave me the design matrix:
>>>
>>> design
>>>>
>>> CS_Bol CS_Mont CS_Ric SG_Bol SG_Mont SG_Ric
>>> CS_Mont 0 1 0 0 0 0
>>> CS_Mont.1 0 1 0 0 0 0
>>> CS_Bol 1 0 0 0 0 0
>>> CS_Bol.1 1 0 0 0 0 0
>>> CS_Ric 0 0 1 0 0 0
>>> CS_Ric.1 0 0 1 0 0 0
>>> SG_Mont 0 0 0 0 1 0
>>> SG_Mont.1 0 0 0 0 1 0
>>> SG_Bol 0 0 0 1 0 0
>>> SG_Bol.1 0 0 0 1 0 0
>>> SG_Ric 0 0 0 0 0 1
>>> SG_Ric.1 0 0 0 0 0 1
>>> attr(,"assign")
>>> [1] 1 1 1 1 1 1
>>> attr(,"contrasts")
>>> attr(,"contrasts")$group
>>> [1] "contr.treatment"
>>>
>>> And then I estimated the trended and tag wise dispersion and fit the model
>>> doing:
>>>
>>> disp.tren <- estimateGLMTrendedDisp(**DGEnorm,design)
>>>> disp.tag <- estimateGLMTagwiseDisp(disp.**tren,design)
>>>> fit <- glmFit(disp.tag,design)
>>>>
>>>
>>> When I made some contrasts to find DE miRNAs, for example:
>>>
>>> my.constrasts <- makeContrasts(CS_BolvsMont = CS_Bol-CS_Mont,
>>>>
>>> CSvsSG_BolvsMont = (CS_Bol-CS_Mont)-(SG_Bol-SG_**Mont), levels=design)
>>>
>>>> lrt <- glmLRT(fit, contrast=my.constrasts[,"CS_**BolvsMont"])
>>>>
>>>
>>> I expected to find DE miRNAs due the environment effect (CS_BolvsMont) and
>>> for example DE miRNAs due the interaction genotypeXenvironment (
>>> CSvsSG_BolvsMont).
>>>
>>> However the results do not seems to reflect it, since I did not get even a
>>> single DE miRNA with significant FDR (even less than 20%!!!!) and going
>>> back to the counts in the raw data I find reasonable differences in their
>>> expression, which was expected. I forgot to mention that I decided to
>>> consider stage by stage separately and not add one more factor on the
>>> model, since I am not interested, for the moment, on the time course (as I
>>> wrote in the previous email - see below).
>>>
>>> Could you (or any body else from the list) give me some advise regarding
>>> the code? Is this matrix appropriate for the kind of comparisons I am
>>> interested on?
>>>
>>> Thank you in advance for any input.
>>>
>>> Daniela
>>>
>>>
>>>
>>>
>>> 2012/10/30 Gordon K Smyth <smyth at wehi.edu.au>
>>>
>>> Dear Daniela,
>>>>
>>>> edgeR can work with any design matrix. Just setup your interaction
>>>> model using standard R model formula. See for example Chapter 11 of:
>>>>
>>>>
>>>> http://cran.r-project.org/doc/****manuals/R-intro.pdf<http://cran.r-project.org/doc/**manuals/R-intro.pdf>
>> <http://**cran.r-project.org/doc/**manuals/R-intro.pdf<http://cran.r-project.org/doc/manuals/R-intro.pdf>
>>>
>>
>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>> Date: Mon, 29 Oct 2012 16:24:31 +0100
>>>>
>>>>> From: Daniela Lopes Paim Pinto <d.lopespaimpinto at sssup.it>
>>>>> To: bioconductor at r-project.org
>>>>> Subject: [BioC] How to design matrix on edgeR to study genotype x
>>>>> environmental interaction
>>>>>
>>>>> Dear all,
>>>>>
>>>>> I'm currently working with data coming from deep sequencing of 48 small
>>>>> RNAs libraries and using edgeR to identify DE miRNAs. I could not figure
>>>>> out how to design my matrix for the following experimental design:
>>>>>
>>>>> I have 2 varieties (genotypes), cultivated in 3 different locations
>>>>> (environments) and collected in 4 physiological stages. None of them
>>>>> represent a control treatment. I'm particulary interested on identifying
>>>>> those miRNAs which modulate their expression dependent on genotypes (G),
>>>>> environments (E) and G x E interaction. For instance the same variety in
>>>>> the 3 different locations, both varieties in the same location and both
>>>>> varieties in the 3 different locations.
>>>>>
>>>>> I was wondering if I could use the section 3.3 of edgeR user guide as
>>>>> reference or if someone could suggest me any other alternative method.
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Daniela
>>>>>
>>>>>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list