[BioC] Defining the model matrix for a multi-factorial design with interactions - edgeR

Wed Jan 29 19:12:56 CET 2014

Hi Marco,

Note that your models 2 & 3 are exactly the same model. The expression 
in your formula for model 3:

Genotype:Load:Time

is more or less equivalent to:

factor(paste(design$Genotype, design$Load, design$Time, sep=":"))

except that the factor levels might possibly be in a different order. As 
for your question about doing the same test with Models 1 & 2, I believe 
that both models are equivalent parametrizations of the same design, so 
you should get (nearly) identical dispersion estimates, and performing 
equivalent tests with the two models should give you the same p-values. 
If you are not getting the same p-values, then you have not properly set 
up the contrast for Model 1. This isn't surprising, because a 3-way 
interaction model is really confusing to work with, and I can't even 
tell you what the correct contrast would be. I would recommend that you 
stick with Model 2 (or 3), where each coefficient has a direct 
biological interpretation as the estimated expression level of a group. 
Designing contrasts for this parametrization is much easier, in my 
opinion. I think you would agree, since you got this contrast correct in 
your example code.

-Ryan

On 1/29/14, 7:11 AM, Marco [guest] wrote:
> Hi all!
>
> I have a question about the way to define the model matrix for a multi-factorial design in which I want to model all the interactions. In particular, I have a 3x3x2 factorial design (3 factors) and three biological replicates, so I have 3x3x2x3 = 54 sample.
> The three factors are:
> - GENOTYPE with the three levels F, G and P
> - LOAD with levels HIGH, MEDIUM and LOW
> - TIME with levels H and PH
>
> So this is the design matrix:
>
>> design
>         Time   Load Genotype
> FHH_1     H   High        F
> FHPH_1   PH   High        F
> FHH_2     H   High        F
> FHPH_2   PH   High        F
> FHH_3     H   High        F
> FHPH_3   PH   High        F
> FMH_1     H Medium        F
> FMPH_1   PH Medium        F
> FMH_2     H Medium        F
> FMPH_2   PH Medium        F
> FMH_3     H Medium        F
> FMPH_3   PH Medium        F
> FLH_1     H    Low        F
> FLPH_1   PH    Low        F
> FLH_2     H    Low        F
> FLPH_2   PH    Low        F
> FLH_3     H    Low        F
> FLPH_3   PH    Low        F
> GHH_1     H   High        G
> GHPH_1   PH   High        G
> GHH_2     H   High        G
> GHPH_2   PH   High        G
> GHH_3     H   High        G
> GHPH_3   PH   High        G
> GMH_1     H Medium        G
> GMPH_1   PH Medium        G
> GMH_2     H Medium        G
> GMPH_2   PH Medium        G
> GMH_3     H Medium        G
> GMPH_3   PH Medium        G
> GLH_1     H    Low        G
> GLPH_1   PH    Low        G
> GLH_2     H    Low        G
> GLPH_2   PH    Low        G
> GLH_3     H    Low        G
> GLPH_3   PH    Low        G
> PHH_1     H   High        P
> PHPH_1   PH   High        P
> PHH_2     H   High        P
> PHPH_2   PH   High        P
> PHH_3     H   High        P
> PHPH_3   PH   High        P
> PMH_1     H Medium        P
> PMPH_1   PH Medium        P
> PMH_2     H Medium        P
> PMPH_2   PH Medium        P
> PMH_3     H Medium        P
> PMPH_3   PH Medium        P
> PLH_1     H    Low        P
> PLPH_1   PH    Low        P
> PLH_2     H    Low        P
> PLPH_2   PH    Low        P
> PLH_3     H    Low        P
> PLPH_3   PH    Low        P
>
>
> I am interested in finding all possible interactions between the three factors, so I want to model also the three way interactions GENOTYPE:LOAD:TIME. I have already read the edgeR vignette so I was wondering if the best way to define the model matrix is by defining it with a formula or by defining each treatment combination as a group.
> I'm trying to understand which are the differences between the two alternatives, so I tried to estimate both the models:
>
> Model 1)
>    mydesign1<-model.matrix(~Genotype * Load * Time, data=design)
>    D1 <- estimateGLMCommonDisp(D, mydesign1) # D is the DGEList of the normalized counts
> 	D1 <- estimateGLMTrendedDisp(D1, mydesign1)
> 	fit1 <- glmFit(D1, mydesign1)
>
> The estimated coefficients are:
> 	> colnames(fit1)
> 	 [1] "(Intercept)"                 "GenotypeG"
> 	 [3] "GenotypeP"                   "LoadLow"
> 	 [5] "LoadMedium"                  "TimePH"
> 	 [7] "GenotypeG:LoadLow"           "GenotypeP:LoadLow"
> 	 [9] "GenotypeG:LoadMedium"        "GenotypeP:LoadMedium"
> 	[11] "GenotypeG:TimePH"            "GenotypeP:TimePH"
> 	[13] "LoadLow:TimePH"              "LoadMedium:TimePH"
> 	[15] "GenotypeG:LoadLow:TimePH"    "GenotypeP:LoadLow:TimePH"
> 	[17] "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"
>    
>    
> Model 2)
>    Group <- factor(paste(design$Genotype, design$Load, design$Time, sep="."))
>    # I have 18 unique levels in Group (54 samples / 3 biological replicates)
>
>    mydesign2<-model.matrix(~0+Group)
>    colnames(mydesign2) <- levels(Group)
> 	
> 	D2 <- estimateGLMCommonDisp(D, mydesign2)
> 	D2 <- estimateGLMTrendedDisp(D2, mydesign2)
> 	
> 	fit2<-glmFit(D2, mydesign2)
> 	> colnames(fit2)
> 	 [1] "H.High.F"    "H.High.G"    "H.High.P"    "H.Low.F"     "H.Low.G"
> 	 [6] "H.Low.P"     "H.Medium.F"  "H.Medium.G"  "H.Medium.P"  "PH.High.F"
> 	[11] "PH.High.G"   "PH.High.P"   "PH.Low.F"    "PH.Low.G"    "PH.Low.P"
> 	[16] "PH.Medium.F" "PH.Medium.G" "PH.Medium.P"
>    
> Now, to understand the differences between the two models, my question was this:
> if I define the contrast:
>
>    contrast1<-c(rep(0,14), 1, 0, -1, 0) # "GenotypeG:LoadLow:TimePH" vs "GenotypeG:LoadMedium:TimePH"
>
> on the first model and I make a LR test, will I obtain the same result as making a LR test on the contrast:
>
>    contrast2<-makeContrasts(G.LowvsMedium.PH = PH.Low.G - PH.Medium.G, levels=mydesign2)
>
> on the second model?
> This is the code I used to make the tests:
>
>    lrt1.G.LowvsMedium.PH<-glmLRT(fit1, contrast=contrast1)
>    summary(decideTestsDGE(lrt1.G.LowvsMedium.PH))
>    
> and
>
>    lrt2.G.LowvsMedium.PH<-glmLRT(fit2, contrast=contrast2[," G.LowvsMedium.PH"])
>    summary(decideTestsDGE(lrt2.G.LowvsMedium.PH))
>
> The answer to the question is no, because I've tried to do it on my data and the results are different between the two tests. But why? Am I doing two completely different tests? The two parameters of the two models aren't saying me the same thing?
>
> In a moment of madness :) I tried to estimate another model, with only the three-level interactions between the factors.
>
> Model 3)
>    mydesign3<-model.matrix(~0+Genotype:Load:Time, data=design)
>
>    D3 <- estimateGLMCommonDisp(D, mydesign3)
>    D3 <- estimateGLMTrendedDisp(D3, mydesign3)
>
>    fit3<-glmFit(D3, mydesign3)
>    > colnames(fit3)
>    [1] "GenotypeF:LoadHigh:TimeH"    "GenotypeG:LoadHigh:TimeH"
>    [3] "GenotypeP:LoadHigh:TimeH"    "GenotypeF:LoadLow:TimeH"
>    [5] "GenotypeG:LoadLow:TimeH"     "GenotypeP:LoadLow:TimeH"
>    [7] "GenotypeF:LoadMedium:TimeH"  "GenotypeG:LoadMedium:TimeH"
>    [9] "GenotypeP:LoadMedium:TimeH"  "GenotypeF:LoadHigh:TimePH"
>    [11] "GenotypeG:LoadHigh:TimePH"   "GenotypeP:LoadHigh:TimePH"
>    [13] "GenotypeF:LoadLow:TimePH"    "GenotypeG:LoadLow:TimePH"
>    [15] "GenotypeP:LoadLow:TimePH"    "GenotypeF:LoadMedium:TimePH"
>    [17] "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"
>    
> Well, this model gave me the same parameter estimates as the model 2 (each treatment combination as a group) and, obviously, the same results in terms of DEG. But is it statistically correct to estimate a model with three-level interactions without any principal effect?
>
> Thanks in advance for any replies!
>
> Marco
>
>   -- output of sessionInfo():
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252
> [3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
> [5] LC_TIME=Italian_Italy.1252
>
> attached base packages:
> [1] splines   parallel  stats     graphics  grDevices utils     datasets
> [8] methods   base
>
> other attached packages:
>   [1] EDASeq_1.8.0         ShortRead_1.20.0     Rsamtools_1.14.2
>   [4] lattice_0.20-24      Biostrings_2.30.1    GenomicRanges_1.14.4
>   [7] XVector_0.2.0        IRanges_1.20.6       Biobase_2.22.0
> [10] BiocGenerics_0.8.0   edgeR_3.4.2          limma_3.18.9
> [13] aroma.light_1.32.0   matrixStats_0.8.14   yeastRNASeq_0.0.10
>
> loaded via a namespace (and not attached):
>   [1] annotate_1.40.0      AnnotationDbi_1.24.0 bitops_1.0-6
>   [4] DBI_0.2-7            DESeq_1.14.0         genefilter_1.44.0
>   [7] geneplotter_1.40.0   grid_3.0.2           hwriter_1.3
> [10] latticeExtra_0.6-26  R.methodsS3_1.6.1    R.oo_1.17.0
> [13] RColorBrewer_1.0-5   RSQLite_0.11.4       stats4_3.0.2
> [16] survival_2.37-6      tools_3.0.2          XML_3.98-1.1
> [19] xtable_1.7-1         zlibbioc_1.8.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor