[BioC] Defining the model matrix for a multi-factorial design with interactions - edgeR

Wed Jan 29 16:11:03 CET 2014

Hi all!

I have a question about the way to define the model matrix for a multi-factorial design in which I want to model all the interactions. In particular, I have a 3x3x2 factorial design (3 factors) and three biological replicates, so I have 3x3x2x3 = 54 sample.
The three factors are:
- GENOTYPE with the three levels F, G and P
- LOAD with levels HIGH, MEDIUM and LOW
- TIME with levels H and PH

So this is the design matrix:

> design
       Time   Load Genotype
FHH_1     H   High        F
FHPH_1   PH   High        F
FHH_2     H   High        F
FHPH_2   PH   High        F
FHH_3     H   High        F
FHPH_3   PH   High        F
FMH_1     H Medium        F
FMPH_1   PH Medium        F
FMH_2     H Medium        F
FMPH_2   PH Medium        F
FMH_3     H Medium        F
FMPH_3   PH Medium        F
FLH_1     H    Low        F
FLPH_1   PH    Low        F
FLH_2     H    Low        F
FLPH_2   PH    Low        F
FLH_3     H    Low        F
FLPH_3   PH    Low        F
GHH_1     H   High        G
GHPH_1   PH   High        G
GHH_2     H   High        G
GHPH_2   PH   High        G
GHH_3     H   High        G
GHPH_3   PH   High        G
GMH_1     H Medium        G
GMPH_1   PH Medium        G
GMH_2     H Medium        G
GMPH_2   PH Medium        G
GMH_3     H Medium        G
GMPH_3   PH Medium        G
GLH_1     H    Low        G
GLPH_1   PH    Low        G
GLH_2     H    Low        G
GLPH_2   PH    Low        G
GLH_3     H    Low        G
GLPH_3   PH    Low        G
PHH_1     H   High        P
PHPH_1   PH   High        P
PHH_2     H   High        P
PHPH_2   PH   High        P
PHH_3     H   High        P
PHPH_3   PH   High        P
PMH_1     H Medium        P
PMPH_1   PH Medium        P
PMH_2     H Medium        P
PMPH_2   PH Medium        P
PMH_3     H Medium        P
PMPH_3   PH Medium        P
PLH_1     H    Low        P
PLPH_1   PH    Low        P
PLH_2     H    Low        P
PLPH_2   PH    Low        P
PLH_3     H    Low        P
PLPH_3   PH    Low        P

I am interested in finding all possible interactions between the three factors, so I want to model also the three way interactions GENOTYPE:LOAD:TIME. I have already read the edgeR vignette so I was wondering if the best way to define the model matrix is by defining it with a formula or by defining each treatment combination as a group.
I'm trying to understand which are the differences between the two alternatives, so I tried to estimate both the models:

Model 1)
  mydesign1<-model.matrix(~Genotype * Load * Time, data=design)
  D1 <- estimateGLMCommonDisp(D, mydesign1) # D is the DGEList of the normalized counts
	D1 <- estimateGLMTrendedDisp(D1, mydesign1)
	fit1 <- glmFit(D1, mydesign1)

The estimated coefficients are:
	> colnames(fit1)
	 [1] "(Intercept)"                 "GenotypeG"                  
	 [3] "GenotypeP"                   "LoadLow"                    
	 [5] "LoadMedium"                  "TimePH"                     
	 [7] "GenotypeG:LoadLow"           "GenotypeP:LoadLow"          
	 [9] "GenotypeG:LoadMedium"        "GenotypeP:LoadMedium"       
	[11] "GenotypeG:TimePH"            "GenotypeP:TimePH"           
	[13] "LoadLow:TimePH"              "LoadMedium:TimePH"          
	[15] "GenotypeG:LoadLow:TimePH"    "GenotypeP:LoadLow:TimePH"   
	[17] "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"

Model 2)
  Group <- factor(paste(design$Genotype, design$Load, design$Time, sep="."))
  # I have 18 unique levels in Group (54 samples / 3 biological replicates)

  mydesign2<-model.matrix(~0+Group)
  colnames(mydesign2) <- levels(Group)

	D2 <- estimateGLMCommonDisp(D, mydesign2)
	D2 <- estimateGLMTrendedDisp(D2, mydesign2)

	fit2<-glmFit(D2, mydesign2)
	> colnames(fit2)
	 [1] "H.High.F"    "H.High.G"    "H.High.P"    "H.Low.F"     "H.Low.G"    
	 [6] "H.Low.P"     "H.Medium.F"  "H.Medium.G"  "H.Medium.P"  "PH.High.F"  
	[11] "PH.High.G"   "PH.High.P"   "PH.Low.F"    "PH.Low.G"    "PH.Low.P"   
	[16] "PH.Medium.F" "PH.Medium.G" "PH.Medium.P"

Now, to understand the differences between the two models, my question was this:
if I define the contrast:

  contrast1<-c(rep(0,14), 1, 0, -1, 0) # "GenotypeG:LoadLow:TimePH" vs "GenotypeG:LoadMedium:TimePH"         

on the first model and I make a LR test, will I obtain the same result as making a LR test on the contrast:

  contrast2<-makeContrasts(G.LowvsMedium.PH = PH.Low.G - PH.Medium.G, levels=mydesign2)  

on the second model?
This is the code I used to make the tests:

  lrt1.G.LowvsMedium.PH<-glmLRT(fit1, contrast=contrast1)
  summary(decideTestsDGE(lrt1.G.LowvsMedium.PH))

and

  lrt2.G.LowvsMedium.PH<-glmLRT(fit2, contrast=contrast2[," G.LowvsMedium.PH"])
  summary(decideTestsDGE(lrt2.G.LowvsMedium.PH))

The answer to the question is no, because I've tried to do it on my data and the results are different between the two tests. But why? Am I doing two completely different tests? The two parameters of the two models aren't saying me the same thing? 

In a moment of madness :) I tried to estimate another model, with only the three-level interactions between the factors.

Model 3)
  mydesign3<-model.matrix(~0+Genotype:Load:Time, data=design)

  D3 <- estimateGLMCommonDisp(D, mydesign3)
  D3 <- estimateGLMTrendedDisp(D3, mydesign3)

  fit3<-glmFit(D3, mydesign3)
  > colnames(fit3)
  [1] "GenotypeF:LoadHigh:TimeH"    "GenotypeG:LoadHigh:TimeH"   
  [3] "GenotypeP:LoadHigh:TimeH"    "GenotypeF:LoadLow:TimeH"    
  [5] "GenotypeG:LoadLow:TimeH"     "GenotypeP:LoadLow:TimeH"    
  [7] "GenotypeF:LoadMedium:TimeH"  "GenotypeG:LoadMedium:TimeH" 
  [9] "GenotypeP:LoadMedium:TimeH"  "GenotypeF:LoadHigh:TimePH"  
  [11] "GenotypeG:LoadHigh:TimePH"   "GenotypeP:LoadHigh:TimePH"  
  [13] "GenotypeF:LoadLow:TimePH"    "GenotypeG:LoadLow:TimePH"   
  [15] "GenotypeP:LoadLow:TimePH"    "GenotypeF:LoadMedium:TimePH"
  [17] "GenotypeG:LoadMedium:TimePH" "GenotypeP:LoadMedium:TimePH"

Well, this model gave me the same parameter estimates as the model 2 (each treatment combination as a group) and, obviously, the same results in terms of DEG. But is it statistically correct to estimate a model with three-level interactions without any principal effect?

Thanks in advance for any replies!

Marco

 -- output of sessionInfo(): 

R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Italian_Italy.1252  LC_CTYPE=Italian_Italy.1252   
[3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.1252    

attached base packages:
[1] splines   parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] EDASeq_1.8.0         ShortRead_1.20.0     Rsamtools_1.14.2    
 [4] lattice_0.20-24      Biostrings_2.30.1    GenomicRanges_1.14.4
 [7] XVector_0.2.0        IRanges_1.20.6       Biobase_2.22.0      
[10] BiocGenerics_0.8.0   edgeR_3.4.2          limma_3.18.9        
[13] aroma.light_1.32.0   matrixStats_0.8.14   yeastRNASeq_0.0.10  

loaded via a namespace (and not attached):
 [1] annotate_1.40.0      AnnotationDbi_1.24.0 bitops_1.0-6        
 [4] DBI_0.2-7            DESeq_1.14.0         genefilter_1.44.0   
 [7] geneplotter_1.40.0   grid_3.0.2           hwriter_1.3         
[10] latticeExtra_0.6-26  R.methodsS3_1.6.1    R.oo_1.17.0         
[13] RColorBrewer_1.0-5   RSQLite_0.11.4       stats4_3.0.2        
[16] survival_2.37-6      tools_3.0.2          XML_3.98-1.1        
[19] xtable_1.7-1         zlibbioc_1.8.0

--
Sent via the guest posting facility at bioconductor.org.