[BioC] Defining the model matrix for a multi-factorial design with interactions - edgeR
Marco [guest]
guest at bioconductor.org
Wed Jan 29 16:11:03 CET 2014
Hi all!
I have a question about the way to define the model matrix for a multi-factorial design in which I want to model all the interactions. In particular, I have a 3x3x2 factorial design (3 factors) and three biological replicates, so I have 3x3x2x3 = 54 sample.
The three factors are:
- GENOTYPE with the three levels F, G and P
- LOAD with levels HIGH, MEDIUM and LOW
- TIME with levels H and PH
So this is the design matrix:
> design
Time Load Genotype
FHH_1 H High F
FHPH_1 PH High F
FHH_2 H High F
FHPH_2 PH High F
FHH_3 H High F
FHPH_3 PH High F
FMH_1 H Medium F
FMPH_1 PH Medium F
FMH_2 H Medium F
FMPH_2 PH Medium F
FMH_3 H Medium F
FMPH_3 PH Medium F
FLH_1 H Low F
FLPH_1 PH Low F
FLH_2 H Low F
FLPH_2 PH Low F
FLH_3 H Low F
FLPH_3 PH Low F
GHH_1 H High G
GHPH_1 PH High G
GHH_2 H High G
GHPH_2 PH High G
GHH_3 H High G
GHPH_3 PH High G
GMH_1 H Medium G
GMPH_1 PH Medium G
GMH_2 H Medium G
GMPH_2 PH Medium G
GMH_3 H Medium G
GMPH_3 PH Medium G
GLH_1 H Low G
GLPH_1 PH Low G
GLH_2 H Low G
GLPH_2 PH Low G
GLH_3 H Low G
GLPH_3 PH Low G
PHH_1 H High P
PHPH_1 PH High P
PHH_2 H High P
PHPH_2 PH High P
PHH_3 H High P
PHPH_3 PH High P
PMH_1 H Medium P
PMPH_1 PH Medium P
PMH_2 H Medium P
PMPH_2 PH Medium P
PMH_3 H Medium P
PMPH_3 PH Medium P
PLH_1 H Low P
PLPH_1 PH Low P
PLH_2 H Low P
PLPH_2 PH Low P
PLH_3 H Low P
PLPH_3 PH Low P
I am interested in finding all possible interactions between the three factors, so I want to model also the three way interactions GENOTYPE:LOAD:TIME. I have already read the edgeR vignette so I was wondering if the best way to define the model matrix is by defining it with a formula or by defining each treatment combination as a group.
I'm trying to understand which are the differences between the two alternatives, so I tried to estimate both the models:
Model 1)
mydesign1<-model.matrix(~Genotype * Load * Time, data=design)
D1 <- estimateGLMCommonDisp(D, mydesign1) # D is the DGEList of the normalized counts
D1 <- estimateGLMTrendedDisp(D1, mydesign1)
fit1 <- glmFit(D1, mydesign1)
The estimated coefficients are:
> colnames(fit1)
[1] "(Intercept)" "GenotypeG"
[3] "GenotypeP" "LoadLow"
[5] "LoadMedium" "TimePH"
[7] "GenotypeG:LoadLow" "GenotypeP:LoadLow"
[9] "GenotypeG:LoadMedium" "GenotypeP:LoadMedium"
[11] "GenotypeG:TimePH" "GenotypeP:TimePH"
[13] "LoadLow:TimePH" "LoadMedium:TimePH"
[15] "GenotypeG:LoadLow:TimePH" "GenotypeP:LoadLow:TimePH"
[17] "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"
Model 2)
Group <- factor(paste(design$Genotype, design$Load, design$Time, sep="."))
# I have 18 unique levels in Group (54 samples / 3 biological replicates)
mydesign2<-model.matrix(~0+Group)
colnames(mydesign2) <- levels(Group)
D2 <- estimateGLMCommonDisp(D, mydesign2)
D2 <- estimateGLMTrendedDisp(D2, mydesign2)
fit2<-glmFit(D2, mydesign2)
> colnames(fit2)
[1] "H.High.F" "H.High.G" "H.High.P" "H.Low.F" "H.Low.G"
[6] "H.Low.P" "H.Medium.F" "H.Medium.G" "H.Medium.P" "PH.High.F"
[11] "PH.High.G" "PH.High.P" "PH.Low.F" "PH.Low.G" "PH.Low.P"
[16] "PH.Medium.F" "PH.Medium.G" "PH.Medium.P"
Now, to understand the differences between the two models, my question was this:
if I define the contrast:
contrast1<-c(rep(0,14), 1, 0, -1, 0) # "GenotypeG:LoadLow:TimePH" vs "GenotypeG:LoadMedium:TimePH"
on the first model and I make a LR test, will I obtain the same result as making a LR test on the contrast:
contrast2<-makeContrasts(G.LowvsMedium.PH = PH.Low.G - PH.Medium.G, levels=mydesign2)
on the second model?
This is the code I used to make the tests:
lrt1.G.LowvsMedium.PH<-glmLRT(fit1, contrast=contrast1)
summary(decideTestsDGE(lrt1.G.LowvsMedium.PH))
and
lrt2.G.LowvsMedium.PH<-glmLRT(fit2, contrast=contrast2[," G.LowvsMedium.PH"])
summary(decideTestsDGE(lrt2.G.LowvsMedium.PH))
The answer to the question is no, because I've tried to do it on my data and the results are different between the two tests. But why? Am I doing two completely different tests? The two parameters of the two models aren't saying me the same thing?
In a moment of madness :) I tried to estimate another model, with only the three-level interactions between the factors.
Model 3)
mydesign3<-model.matrix(~0+Genotype:Load:Time, data=design)
D3 <- estimateGLMCommonDisp(D, mydesign3)
D3 <- estimateGLMTrendedDisp(D3, mydesign3)
fit3<-glmFit(D3, mydesign3)
> colnames(fit3)
[1] "GenotypeF:LoadHigh:TimeH" "GenotypeG:LoadHigh:TimeH"
[3] "GenotypeP:LoadHigh:TimeH" "GenotypeF:LoadLow:TimeH"
[5] "GenotypeG:LoadLow:TimeH" "GenotypeP:LoadLow:TimeH"
[7] "GenotypeF:LoadMedium:TimeH" "GenotypeG:LoadMedium:TimeH"
[9] "GenotypeP:LoadMedium:TimeH" "GenotypeF:LoadHigh:TimePH"
[11] "GenotypeG:LoadHigh:TimePH" "GenotypeP:LoadHigh:TimePH"
[13] "GenotypeF:LoadLow:TimePH" "GenotypeG:LoadLow:TimePH"
[15] "GenotypeP:LoadLow:TimePH" "GenotypeF:LoadMedium:TimePH"
[17] "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"
Well, this model gave me the same parameter estimates as the model 2 (each treatment combination as a group) and, obviously, the same results in terms of DEG. But is it statistically correct to estimate a model with three-level interactions without any principal effect?
Thanks in advance for any replies!
Marco
-- output of sessionInfo():
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C
[5] LC_TIME=Italian_Italy.1252
attached base packages:
[1] splines parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] EDASeq_1.8.0 ShortRead_1.20.0 Rsamtools_1.14.2
[4] lattice_0.20-24 Biostrings_2.30.1 GenomicRanges_1.14.4
[7] XVector_0.2.0 IRanges_1.20.6 Biobase_2.22.0
[10] BiocGenerics_0.8.0 edgeR_3.4.2 limma_3.18.9
[13] aroma.light_1.32.0 matrixStats_0.8.14 yeastRNASeq_0.0.10
loaded via a namespace (and not attached):
[1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6
[4] DBI_0.2-7 DESeq_1.14.0 genefilter_1.44.0
[7] geneplotter_1.40.0 grid_3.0.2 hwriter_1.3
[10] latticeExtra_0.6-26 R.methodsS3_1.6.1 R.oo_1.17.0
[13] RColorBrewer_1.0-5 RSQLite_0.11.4 stats4_3.0.2
[16] survival_2.37-6 tools_3.0.2 XML_3.98-1.1
[19] xtable_1.7-1 zlibbioc_1.8.0
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list