[BioC] edgeR design matrix and contrast questions

Ryan C. Thompson rct at thompsonclan.org
Wed Mar 6 19:52:53 CET 2013


As long as your design formula contains the same factors (or products 
of factors), it will produce an equivalent design. The only difference 
is in the parametrization. The only time the order matters is if you 
are using a no-intercept formula, in which case the first factor is 
coded directly into the design, while the remaining factors are coded 
as contrasts. In a design with an intercept, all factors are coded as 
contrasts, and the order of factors in the formula is irrelevant.

On Wed 06 Mar 2013 10:37:12 AM PST, Findley Finseth wrote:
> Hi Ryan --
>
> Thanks so much for your helpful response.
>
> One more clarification - apologies, I am new to contrasts.  I
> originally set up my matrix as you suggested, but was concerned that
> the order of the factors mattered in the formula (as I know it
> sometimes can).  In other words, to properly reflect that my
> experiment is paired, I thought "subject" would need to be first
> (which is why I did my funky contrasts).  As I understand from your
> response, this is not a concern and subject is fine to list second.
>
> Thanks again,
> Findley
>
>
> On Mar 6, 2013, at 10:15 AM, Ryan Thompson wrote:
>
>> I would recommend creating a design matrix with no intercept term and
>> putting the group factor first in the formula: ~0 + group + subject.
>> This will yield one coefficient for each tissue. Also, if you use
>> sum-to-zero contrasts for subject, then I believe the tissue
>> coefficients are directly interpretable as fitted cpm values
>>
>> On Mar 5, 2013 11:16 PM, "Findley Finseth" <findleyransler at gmail.com
>> <mailto:findleyransler at gmail.com>> wrote:
>>
>>     Hello all --
>>
>>
>>     I am using edgeR to analyze RNASeq data.  Thank you so much for
>>     this software and the clear and straightforward user guide.  I
>>     have a question regarding contrasts using the glm approach for a
>>     paired experimental design.
>>
>>     I have three tissues (PG, Liv, and Test) from six subjects (1-6).
>>       One of my goals is to get a set of genes that are "enriched"
>>     for each tissue.  To this end, I would like to compare each
>>     tissue to the average of the other two (e.g., PG - (liver +
>>     testes)/2), while accounting for the fact that my experiment is
>>     paired by subject.   I understand how to build an appropriate
>>     design matrix and call contrasts to either compare each tissue to
>>     the average of the other two (e.g., edgeR section 3.4.3-4) or
>>     perform pairwise comparisons between tissues while including
>>     subject (e.g., edgeR section 3.4.1) .  However, I start
>>     struggling when I try to combine these goals and compare one
>>     tissue to the average of the other two while pairing by subject.
>>       I have also considered the multi-level approach (e.g., limma
>>     section 8.7) but am not doing both independent and paired
>>     comparisons, so did not think that was an appropriate avenue.
>>
>>     I am pasting my session output with preliminary code below.
>>      Specifically, I could use advice on whether 1) my design matrix
>>     and 2) the way I have called my contrasts are appropriate for my
>>     question.
>>
>>
>>     On a slightly different note, I am getting the following error
>>     when using topTags (also on session output below):
>>
>>     Error in abs(object$table$logFC) :
>>       Non-numeric argument to mathematical function
>>
>>      I noticed there was a previous post in December about similarly
>>     passing a single contrast to a glmLRT and retrieving the same
>>     error, which was corrected in edgeR_3.0.7.  As I am using 3.0.8,
>>     I am assuming the error has more to do with how I have set
>>     up/called my contrasts, but thought I would mention it.
>>
>>
>>
>>
>>     Thank you in advance,
>>
>>     Findley Finseth
>>
>>
>>     PhD Candidate
>>
>>     Dept Ecology and Evolutionary Biology
>>     Cornell University
>>     Ithaca, NY 14850
>>
>>
>>
>>
>>
>>     > library(edgeR)
>>     Loading required package: limma
>>     >
>>     > # making a matrix of factors called "targets", Tissue = tissue,
>>     Ind2 = subject
>>     > targets <-readTargets()
>>     > targets
>>                    X Tissue Ind2
>>     1  JQ_112_Liv_TG    Liv    1
>>     2  JQ_122-2_PG_A     PG    2
>>     3  JQ_179_Liv_AC    Liv    3
>>     4  JQ_191_Test_C   Test    4
>>     5  JQ_255_PG_AGT     PG    5
>>     6  JQ_107_Liv_CC    Liv    6
>>     7  JQ_107_PG_AGT     PG    6
>>     8  JQ_122-2_Test   Test    2
>>     9  JQ_191_Liv_AT    Liv    4
>>     10 JQ_107_Test_G   Test    6
>>     11 JQ_112_PG_CGA     PG    1
>>     12 JQ_112_Test_A   Test    1
>>     13 JQ_122-2_Liv_    Liv    2
>>     14 JQ_191_PG_AGT     PG    4
>>     15 JQ_179_PG_AGT     PG    3
>>     16 JQ_179_Test_A   Test    3
>>     17 JQ_255_Liv_GA    Liv    5
>>     18 JQ_255_Test_G   Test    5
>>     >
>>     > # importing raw data
>>     > rawdata <- read.delim("controlonly_practiceEdgeR.txt")
>>     > head(rawdata)
>>                      Gene sample2 sample6 sample7 sample9 sample10
>>     sample12 sample13 sample17
>>     1  comp326924_c0_seq1      23       7       3       6        8
>>         16        5        6
>>     2  comp434050_c0_seq1       2       0       0      26        0
>>          0        0       34
>>     3   comp28996_c0_seq1     344     161     191     284      144
>>        354      114      338
>>     4 comp1083897_c0_seq1       1       4       1       0        1
>>          2        0        0
>>     5  comp544783_c0_seq1       3       0       0       4        0
>>          0        0       11
>>     6  comp654539_c0_seq1       0       0       0       9        0
>>          1        0       11
>>       sample20 sample22 sample24 sample25 sample29 sample30 sample36
>>     sample37 sample39
>>     1        6       20        4        8       10        2        1
>>            4       32
>>     2        0       49        0       10        0        0        0
>>           19        1
>>     3      154      815      196      163      252       51       74
>>          168      352
>>     4        0        1        4        0        0        2        0
>>            0        4
>>     5        1       19        0        2        2        0        0
>>            3        0
>>     6        1        4        0        5        0        0        0
>>            3        0
>>       sample40
>>     1        2
>>     2       18
>>     3      182
>>     4        0
>>     5        2
>>     6        3
>>     >
>>     > # setting groups equal to tissue differences
>>     > group <- targets$Tissue
>>     > group <- as.factor(group)
>>     > group
>>      [1] Liv  PG   Liv  Test PG   Liv  PG   Test Liv  Test PG   Test
>>     Liv  PG   PG   Test Liv
>>     [18] Test
>>     Levels: Liv PG Test
>>     >
>>     >
>>     > # making my DGE list
>>     > y <- DGEList(counts=rawdata[,2:19],genes=rawdata[,1],group=group)
>>     Calculating library sizes from column totals.
>>     >
>>     > # filtering out lowly expressed genes; since the smallest group
>>     size is six, we keeping genes with at least one count per million
>>     in at least six samples
>>     > keep <- rowSums(cpm(y)>1) >= 6
>>     > y <- y[keep,]
>>     > dim(y)  # retained 30,140 genes
>>     [1] 30140    18
>>     >
>>     >
>>     > #recomputing the library sizes:
>>     > y$samples$lib.size <- colSums(y$counts)
>>     >
>>     > #calculating normalization factors
>>     > y <- calcNormFactors(y)
>>     > y$samples # looks good
>>              group lib.size norm.factors
>>     sample2    Liv 13003017    1.2319091
>>     sample6     PG 12388919    0.7524641
>>     sample7    Liv  8045685    0.6490458
>>     sample9   Test  9164420    1.5566202
>>     sample10    PG 15292051    0.5119163
>>     sample12   Liv 13042773    0.8596694
>>     sample13    PG 11862830    0.6419963
>>     sample17  Test  9472129    1.6400614
>>     sample20   Liv  6514347    0.8206096
>>     sample22  Test 25926605    1.4096284
>>     sample24    PG 12334446    0.7888070
>>     sample25  Test  6283137    1.6230035
>>     sample29   Liv 19970487    0.8060990
>>     sample30    PG  6159314    0.7825644
>>     sample36    PG  3897369    0.9081739
>>     sample37  Test  5778563    1.6550333
>>     sample39   Liv 15619497    0.9618765
>>     sample40  Test  5076345    1.7061590
>>     >
>>     >
>>     >
>>     > # setting subject equal to individuals to reflect fact that
>>     samples are paired
>>     > subject <- factor(targets$Ind2)
>>     >
>>     > ############### building design matrix; could use advice here
>>     regarding appropriateness##########
>>     > design4 <- model.matrix(~subject + group)
>>     > design4
>>        (Intercept) subject2 subject3 subject4 subject5 subject6
>>     groupPG groupTest
>>     1            1        0        0        0        0        0
>>     0         0
>>     2            1        1        0        0        0        0
>>     1         0
>>     3            1        0        1        0        0        0
>>     0         0
>>     4            1        0        0        1        0        0
>>     0         1
>>     5            1        0        0        0        1        0
>>     1         0
>>     6            1        0        0        0        0        1
>>     0         0
>>     7            1        0        0        0        0        1
>>     1         0
>>     8            1        1        0        0        0        0
>>     0         1
>>     9            1        0        0        1        0        0
>>     0         0
>>     10           1        0        0        0        0        1
>>     0         1
>>     11           1        0        0        0        0        0
>>     1         0
>>     12           1        0        0        0        0        0
>>     0         1
>>     13           1        1        0        0        0        0
>>     0         0
>>     14           1        0        0        1        0        0
>>     1         0
>>     15           1        0        1        0        0        0
>>     1         0
>>     16           1        0        1        0        0        0
>>     0         1
>>     17           1        0        0        0        1        0
>>     0         0
>>     18           1        0        0        0        1        0
>>     0         1
>>     attr(,"assign")
>>     [1] 0 1 1 1 1 1 2 2
>>     attr(,"contrasts")
>>     attr(,"contrasts")$subject
>>     [1] "contr.treatment"
>>
>>     attr(,"contrasts")$group
>>     [1] "contr.treatment"
>>
>>     >
>>     > # estimating dispersions
>>     > y4 <- estimateGLMCommonDisp(y,design4)
>>     > y4 <- estimateGLMTrendedDisp(y4,design4)
>>     Loading required package: splines
>>     > y4 <- estimateGLMTagwiseDisp(y4,design4)
>>     >
>>     > # building contrasts
>>     >
>>     > ##############This is where things start to get a bit sticky
>>     for me; I am making some preliminary contrasts below, but am
>>     unsure of whether I've treated the intercept appropriately.  I
>>     think it refers to groupLiv, though I have some uncertainty about
>>     this, given that subject is the first term in the model##########
>>     >
>>     > #########ultimately, I  would like to compare (PG -
>>     (Liv+Test)/2) and so on for each tissue type###########
>>     >
>>     > my.contrast <- makeContrasts(
>>     +  Liv=(Intercept)-(groupPG+groupTest)/2,
>>     +  PG=groupPG-((Intercept)+groupTest)/2,
>>     +  Test=groupTest-(groupPG+(Intercept))/2,
>>     + levels=design4)
>>     Warning message:
>>     In makeContrasts(Liv = (Intercept) - (groupPG + groupTest)/2, PG
>>     = groupPG -  :
>>       Renaming (Intercept) to Intercept
>>     >
>>     >
>>     > # fitting a glm
>>     > fit <- glmFit(y4,design4)
>>     >
>>     >
>>     > # likelihood ratio tests
>>     >
>>     > lrt.Liv <- glmLRT(fit, my.contrast[,"Liv"])
>>     > lrt.PG <- glmLRT(fit, my.contrast[,"PG"])
>>     > lrt.Testes <- glmLRT(fit, my.contrast[,"Test"])
>>     >
>>     > #finding top tags;
>>     > topTags(lrt.Liv)
>>     Error in abs(object$table$logFC) :
>>       Non-numeric argument to mathematical function
>>     > topTags(lrt.PG)
>>     Error in abs(object$table$logFC) :
>>       Non-numeric argument to mathematical function
>>     > topTags(lrt.Testes)
>>     Error in abs(object$table$logFC) :
>>       Non-numeric argument to mathematical function
>>     >
>>     >
>>     > #### Note: I am also getting an error after using topTags about
>>     passing a non-numeric argument to a mathermatical function.  I
>>     noticed there was a previous post about similarly passing a
>>     single contrast to a glmLRT and retrieving the same error, which
>>     was corrected in edgeR_3.0.7. As I am using 3.0.8, I am assuming
>>     the error has more to do with how I have set up/called my
>>     contrasts, but thought I would mention it.
>>     >
>>     > sessionInfo()
>>     R version 2.15.2 (2012-10-26)
>>     Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>>     locale:
>>     [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>>     attached base packages:
>>     [1] splines   stats     graphics  grDevices utils     datasets
>>      methods   base
>>
>>     other attached packages:
>>     [1] edgeR_3.0.8  limma_3.14.4
>>
>>     loaded via a namespace (and not attached):
>>     [1] tools_2.15.2
>>
>>     _______________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>     Search the archives:
>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>



More information about the Bioconductor mailing list