[BioC] Defining the model matrix for a multi-factorial design with interactions - edgeR
rct at thompsonclan.org
Wed Jan 29 19:12:56 CET 2014
Note that your models 2 & 3 are exactly the same model. The expression
in your formula for model 3:
is more or less equivalent to:
factor(paste(design$Genotype, design$Load, design$Time, sep=":"))
except that the factor levels might possibly be in a different order. As
for your question about doing the same test with Models 1 & 2, I believe
that both models are equivalent parametrizations of the same design, so
you should get (nearly) identical dispersion estimates, and performing
equivalent tests with the two models should give you the same p-values.
If you are not getting the same p-values, then you have not properly set
up the contrast for Model 1. This isn't surprising, because a 3-way
interaction model is really confusing to work with, and I can't even
tell you what the correct contrast would be. I would recommend that you
stick with Model 2 (or 3), where each coefficient has a direct
biological interpretation as the estimated expression level of a group.
Designing contrasts for this parametrization is much easier, in my
opinion. I think you would agree, since you got this contrast correct in
your example code.
On 1/29/14, 7:11 AM, Marco [guest] wrote:
> Hi all!
> I have a question about the way to define the model matrix for a multi-factorial design in which I want to model all the interactions. In particular, I have a 3x3x2 factorial design (3 factors) and three biological replicates, so I have 3x3x2x3 = 54 sample.
> The three factors are:
> - GENOTYPE with the three levels F, G and P
> - LOAD with levels HIGH, MEDIUM and LOW
> - TIME with levels H and PH
> So this is the design matrix:
> Time Load Genotype
> FHH_1 H High F
> FHPH_1 PH High F
> FHH_2 H High F
> FHPH_2 PH High F
> FHH_3 H High F
> FHPH_3 PH High F
> FMH_1 H Medium F
> FMPH_1 PH Medium F
> FMH_2 H Medium F
> FMPH_2 PH Medium F
> FMH_3 H Medium F
> FMPH_3 PH Medium F
> FLH_1 H Low F
> FLPH_1 PH Low F
> FLH_2 H Low F
> FLPH_2 PH Low F
> FLH_3 H Low F
> FLPH_3 PH Low F
> GHH_1 H High G
> GHPH_1 PH High G
> GHH_2 H High G
> GHPH_2 PH High G
> GHH_3 H High G
> GHPH_3 PH High G
> GMH_1 H Medium G
> GMPH_1 PH Medium G
> GMH_2 H Medium G
> GMPH_2 PH Medium G
> GMH_3 H Medium G
> GMPH_3 PH Medium G
> GLH_1 H Low G
> GLPH_1 PH Low G
> GLH_2 H Low G
> GLPH_2 PH Low G
> GLH_3 H Low G
> GLPH_3 PH Low G
> PHH_1 H High P
> PHPH_1 PH High P
> PHH_2 H High P
> PHPH_2 PH High P
> PHH_3 H High P
> PHPH_3 PH High P
> PMH_1 H Medium P
> PMPH_1 PH Medium P
> PMH_2 H Medium P
> PMPH_2 PH Medium P
> PMH_3 H Medium P
> PMPH_3 PH Medium P
> PLH_1 H Low P
> PLPH_1 PH Low P
> PLH_2 H Low P
> PLPH_2 PH Low P
> PLH_3 H Low P
> PLPH_3 PH Low P
> I am interested in finding all possible interactions between the three factors, so I want to model also the three way interactions GENOTYPE:LOAD:TIME. I have already read the edgeR vignette so I was wondering if the best way to define the model matrix is by defining it with a formula or by defining each treatment combination as a group.
> I'm trying to understand which are the differences between the two alternatives, so I tried to estimate both the models:
> Model 1)
> mydesign1<-model.matrix(~Genotype * Load * Time, data=design)
> D1 <- estimateGLMCommonDisp(D, mydesign1) # D is the DGEList of the normalized counts
> D1 <- estimateGLMTrendedDisp(D1, mydesign1)
> fit1 <- glmFit(D1, mydesign1)
> The estimated coefficients are:
> > colnames(fit1)
>  "(Intercept)" "GenotypeG"
>  "GenotypeP" "LoadLow"
>  "LoadMedium" "TimePH"
>  "GenotypeG:LoadLow" "GenotypeP:LoadLow"
>  "GenotypeG:LoadMedium" "GenotypeP:LoadMedium"
>  "GenotypeG:TimePH" "GenotypeP:TimePH"
>  "LoadLow:TimePH" "LoadMedium:TimePH"
>  "GenotypeG:LoadLow:TimePH" "GenotypeP:LoadLow:TimePH"
>  "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"
> Model 2)
> Group <- factor(paste(design$Genotype, design$Load, design$Time, sep="."))
> # I have 18 unique levels in Group (54 samples / 3 biological replicates)
> colnames(mydesign2) <- levels(Group)
> D2 <- estimateGLMCommonDisp(D, mydesign2)
> D2 <- estimateGLMTrendedDisp(D2, mydesign2)
> fit2<-glmFit(D2, mydesign2)
> > colnames(fit2)
>  "H.High.F" "H.High.G" "H.High.P" "H.Low.F" "H.Low.G"
>  "H.Low.P" "H.Medium.F" "H.Medium.G" "H.Medium.P" "PH.High.F"
>  "PH.High.G" "PH.High.P" "PH.Low.F" "PH.Low.G" "PH.Low.P"
>  "PH.Medium.F" "PH.Medium.G" "PH.Medium.P"
> Now, to understand the differences between the two models, my question was this:
> if I define the contrast:
> contrast1<-c(rep(0,14), 1, 0, -1, 0) # "GenotypeG:LoadLow:TimePH" vs "GenotypeG:LoadMedium:TimePH"
> on the first model and I make a LR test, will I obtain the same result as making a LR test on the contrast:
> contrast2<-makeContrasts(G.LowvsMedium.PH = PH.Low.G - PH.Medium.G, levels=mydesign2)
> on the second model?
> This is the code I used to make the tests:
> lrt1.G.LowvsMedium.PH<-glmLRT(fit1, contrast=contrast1)
> lrt2.G.LowvsMedium.PH<-glmLRT(fit2, contrast=contrast2[," G.LowvsMedium.PH"])
> The answer to the question is no, because I've tried to do it on my data and the results are different between the two tests. But why? Am I doing two completely different tests? The two parameters of the two models aren't saying me the same thing?
> In a moment of madness :) I tried to estimate another model, with only the three-level interactions between the factors.
> Model 3)
> mydesign3<-model.matrix(~0+Genotype:Load:Time, data=design)
> D3 <- estimateGLMCommonDisp(D, mydesign3)
> D3 <- estimateGLMTrendedDisp(D3, mydesign3)
> fit3<-glmFit(D3, mydesign3)
> > colnames(fit3)
>  "GenotypeF:LoadHigh:TimeH" "GenotypeG:LoadHigh:TimeH"
>  "GenotypeP:LoadHigh:TimeH" "GenotypeF:LoadLow:TimeH"
>  "GenotypeG:LoadLow:TimeH" "GenotypeP:LoadLow:TimeH"
>  "GenotypeF:LoadMedium:TimeH" "GenotypeG:LoadMedium:TimeH"
>  "GenotypeP:LoadMedium:TimeH" "GenotypeF:LoadHigh:TimePH"
>  "GenotypeG:LoadHigh:TimePH" "GenotypeP:LoadHigh:TimePH"
>  "GenotypeF:LoadLow:TimePH" "GenotypeG:LoadLow:TimePH"
>  "GenotypeP:LoadLow:TimePH" "GenotypeF:LoadMedium:TimePH"
>  "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"
> Well, this model gave me the same parameter estimates as the model 2 (each treatment combination as a group) and, obviously, the same results in terms of DEG. But is it statistically correct to estimate a model with three-level interactions without any principal effect?
> Thanks in advance for any replies!
> -- output of sessionInfo():
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>  LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252
>  LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
>  LC_TIME=Italian_Italy.1252
> attached base packages:
>  splines parallel stats graphics grDevices utils datasets
>  methods base
> other attached packages:
>  EDASeq_1.8.0 ShortRead_1.20.0 Rsamtools_1.14.2
>  lattice_0.20-24 Biostrings_2.30.1 GenomicRanges_1.14.4
>  XVector_0.2.0 IRanges_1.20.6 Biobase_2.22.0
>  BiocGenerics_0.8.0 edgeR_3.4.2 limma_3.18.9
>  aroma.light_1.32.0 matrixStats_0.8.14 yeastRNASeq_0.0.10
> loaded via a namespace (and not attached):
>  annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6
>  DBI_0.2-7 DESeq_1.14.0 genefilter_1.44.0
>  geneplotter_1.40.0 grid_3.0.2 hwriter_1.3
>  latticeExtra_0.6-26 R.methodsS3_1.6.1 R.oo_1.17.0
>  RColorBrewer_1.0-5 RSQLite_0.11.4 stats4_3.0.2
>  survival_2.37-6 tools_3.0.2 XML_3.98-1.1
>  xtable_1.7-1 zlibbioc_1.8.0
> Sent via the guest posting facility at bioconductor.org.
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor