[BioC] edgeR design matrix and contrast questions

Thu Mar 7 02:59:23 CET 2013

I just noticed a typo:  To test PG-(Liv+Test)/2 you need the contrast 
groupPG-groupTest/2, rather than what I wrote below.

Gordon

On Thu, 7 Mar 2013, Gordon K Smyth wrote:

> Dear Findley,
>
> The fact that you have paired samples actually doesn't change the way that 
> contrasts are formed in edgeR.  Assuming you use the glm code, the same 
> contrast code works as if you had no pairing.
>
> The design matrix is fine:
>
>  design4 <- model.matrix(~subject + group)
>
> however your contrasts are incorrect.  One never needs to include the 
> intercept term in a contrast.
>
> The coefficients of interest in your model are called groupPG and groupTest. 
> Both of these are relative to the Liv group, so you should think of groupPG 
> as representing PG-Liv and groupTest as representing Test-Liv.
>
> Hence to test Liv-(PG+Test)/2 you want the contrast -(groupPG-groupTest)/2. 
> To test PG-(Liv+Test)/2 you want the contrast groupPG-groupLiv/2.  To test 
> Test-(PG+Liv)/2 you want the contrast groupTest-groupPG/2.
>
> Best wishes
> Gordon
>
>
>> Date: Tue, 5 Mar 2013 15:21:18 -0500
>> From: Findley Finseth <findleyransler at gmail.com>
>> To: Bioconductor mailing list <bioconductor at r-project.org>
>> Subject: [BioC] edgeR design matrix and contrast questions
>> 
>> Hello all --
>> 
>> 
>> I am using edgeR to analyze RNASeq data.  Thank you so much for this 
>> software and the clear and straightforward user guide.  I have a question 
>> regarding contrasts using the glm approach for a paired experimental 
>> design.
>> 
>> I have three tissues (PG, Liv, and Test) from six subjects (1-6).  One of 
>> my goals is to get a set of genes that are "enriched" for each tissue.  To 
>> this end, I would like to compare each tissue to the average of the other 
>> two (e.g., PG - (liver + testes)/2), while accounting for the fact that my 
>> experiment is paired by subject.  I understand how to build an appropriate 
>> design matrix and call contrasts to either compare each tissue to the 
>> average of the other two (e.g., edgeR section 3.4.3-4) or perform pairwise 
>> comparisons between tissues while including subject (e.g., edgeR section 
>> 3.4.1) .  However, I start struggling when I try to combine these goals and 
>> compare one tissue to the average of the other two while pairing by 
>> subject.  I have also considered the multi-level approach (e.g., limma 
>> section 8.7) but am not doing both independent and paired comparisons, so 
>> did not think that was an appropriate avenue.
>> 
>> I am pasting my session output with preliminary code below. Specifically, I 
>> could use advice on whether 1) my design matrix and 2) the way I have 
>> called my contrasts are appropriate for my question.
>> 
>> 
>> On a slightly different note, I am getting the following error when using 
>> topTags (also on session output below):
>> 
>> Error in abs(object$table$logFC) :
>>  Non-numeric argument to mathematical function
>> 
>> I noticed there was a previous post in December about similarly passing a 
>> single contrast to a glmLRT and retrieving the same error, which was 
>> corrected in edgeR_3.0.7.  As I am using 3.0.8, I am assuming the error has 
>> more to do with how I have set up/called my contrasts, but thought I would 
>> mention it.
>> 
>> 
>> 
>> 
>> Thank you in advance,
>> 
>> Findley Finseth
>> 
>> 
>> PhD Candidate
>> 
>> Dept Ecology and Evolutionary Biology
>> Cornell University
>> Ithaca, NY 14850
>> 
>> 
>> 
>> 
>> 
>>> library(edgeR)
>> Loading required package: limma
>>> 
>>> # making a matrix of factors called "targets", Tissue = tissue, Ind2 = 
>>> subject
>>> targets <-readTargets()
>>> targets
>>               X Tissue Ind2
>> 1  JQ_112_Liv_TG    Liv    1
>> 2  JQ_122-2_PG_A     PG    2
>> 3  JQ_179_Liv_AC    Liv    3
>> 4  JQ_191_Test_C   Test    4
>> 5  JQ_255_PG_AGT     PG    5
>> 6  JQ_107_Liv_CC    Liv    6
>> 7  JQ_107_PG_AGT     PG    6
>> 8  JQ_122-2_Test   Test    2
>> 9  JQ_191_Liv_AT    Liv    4
>> 10 JQ_107_Test_G   Test    6
>> 11 JQ_112_PG_CGA     PG    1
>> 12 JQ_112_Test_A   Test    1
>> 13 JQ_122-2_Liv_    Liv    2
>> 14 JQ_191_PG_AGT     PG    4
>> 15 JQ_179_PG_AGT     PG    3
>> 16 JQ_179_Test_A   Test    3
>> 17 JQ_255_Liv_GA    Liv    5
>> 18 JQ_255_Test_G   Test    5
>>> 
>>> # importing raw data
>>> rawdata <- read.delim("controlonly_practiceEdgeR.txt")
>>> head(rawdata)
>>                 Gene sample2 sample6 sample7 sample9 sample10 sample12 
>> sample13 sample17
>> 1  comp326924_c0_seq1      23       7       3       6        8       16 
>> 5        6
>> 2  comp434050_c0_seq1       2       0       0      26        0        0 
>> 0       34
>> 3   comp28996_c0_seq1     344     161     191     284      144      354 
>> 114      338
>> 4 comp1083897_c0_seq1       1       4       1       0        1        2 
>> 0        0
>> 5  comp544783_c0_seq1       3       0       0       4        0        0 
>> 0       11
>> 6  comp654539_c0_seq1       0       0       0       9        0        1 
>> 0       11
>>  sample20 sample22 sample24 sample25 sample29 sample30 sample36 sample37 
>> sample39
>> 1        6       20        4        8       10        2        1        4 
>> 32
>> 2        0       49        0       10        0        0        0       19 
>> 1
>> 3      154      815      196      163      252       51       74      168 
>> 352
>> 4        0        1        4        0        0        2        0        0 
>> 4
>> 5        1       19        0        2        2        0        0        3 
>> 0
>> 6        1        4        0        5        0        0        0        3 
>> 0
>>  sample40
>> 1        2
>> 2       18
>> 3      182
>> 4        0
>> 5        2
>> 6        3
>>> 
>>> # setting groups equal to tissue differences
>>> group <- targets$Tissue
>>> group <- as.factor(group)
>>> group
>> [1] Liv  PG   Liv  Test PG   Liv  PG   Test Liv  Test PG   Test Liv  PG 
>> PG   Test Liv
>> [18] Test
>> Levels: Liv PG Test
>>> 
>>> 
>>> # making my DGE list
>>> y <- DGEList(counts=rawdata[,2:19],genes=rawdata[,1],group=group)
>> Calculating library sizes from column totals.
>>> 
>>> # filtering out lowly expressed genes; since the smallest group size is 
>>> six, we keeping genes with at least one count per million in at least six 
>>> samples
>>> keep <- rowSums(cpm(y)>1) >= 6
>>> y <- y[keep,]
>>> dim(y)  # retained 30,140 genes
>> [1] 30140    18
>>> 
>>> 
>>> #recomputing the library sizes:
>>> y$samples$lib.size <- colSums(y$counts)
>>> 
>>> #calculating normalization factors
>>> y <- calcNormFactors(y)
>>> y$samples # looks good
>>         group lib.size norm.factors
>> sample2    Liv 13003017    1.2319091
>> sample6     PG 12388919    0.7524641
>> sample7    Liv  8045685    0.6490458
>> sample9   Test  9164420    1.5566202
>> sample10    PG 15292051    0.5119163
>> sample12   Liv 13042773    0.8596694
>> sample13    PG 11862830    0.6419963
>> sample17  Test  9472129    1.6400614
>> sample20   Liv  6514347    0.8206096
>> sample22  Test 25926605    1.4096284
>> sample24    PG 12334446    0.7888070
>> sample25  Test  6283137    1.6230035
>> sample29   Liv 19970487    0.8060990
>> sample30    PG  6159314    0.7825644
>> sample36    PG  3897369    0.9081739
>> sample37  Test  5778563    1.6550333
>> sample39   Liv 15619497    0.9618765
>> sample40  Test  5076345    1.7061590
>>> 
>>> 
>>> 
>>> # setting subject equal to individuals to reflect fact that samples are 
>>> paired
>>> subject <- factor(targets$Ind2)
>>> 
>>> ############### building design matrix; could use advice here regarding 
>>> appropriateness##########
>>> design4 <- model.matrix(~subject + group)
>>> design4
>>   (Intercept) subject2 subject3 subject4 subject5 subject6 groupPG 
>> groupTest
>> 1            1        0        0        0        0        0       0 
>> 0
>> 2            1        1        0        0        0        0       1 
>> 0
>> 3            1        0        1        0        0        0       0 
>> 0
>> 4            1        0        0        1        0        0       0 
>> 1
>> 5            1        0        0        0        1        0       1 
>> 0
>> 6            1        0        0        0        0        1       0 
>> 0
>> 7            1        0        0        0        0        1       1 
>> 0
>> 8            1        1        0        0        0        0       0 
>> 1
>> 9            1        0        0        1        0        0       0 
>> 0
>> 10           1        0        0        0        0        1       0 
>> 1
>> 11           1        0        0        0        0        0       1 
>> 0
>> 12           1        0        0        0        0        0       0 
>> 1
>> 13           1        1        0        0        0        0       0 
>> 0
>> 14           1        0        0        1        0        0       1 
>> 0
>> 15           1        0        1        0        0        0       1 
>> 0
>> 16           1        0        1        0        0        0       0 
>> 1
>> 17           1        0        0        0        1        0       0 
>> 0
>> 18           1        0        0        0        1        0       0 
>> 1
>> attr(,"assign")
>> [1] 0 1 1 1 1 1 2 2
>> attr(,"contrasts")
>> attr(,"contrasts")$subject
>> [1] "contr.treatment"
>> 
>> attr(,"contrasts")$group
>> [1] "contr.treatment"
>> 
>>> 
>>> # estimating dispersions
>>> y4 <- estimateGLMCommonDisp(y,design4)
>>> y4 <- estimateGLMTrendedDisp(y4,design4)
>> Loading required package: splines
>>> y4 <- estimateGLMTagwiseDisp(y4,design4)
>>> 
>>> # building contrasts
>>> 
>>> ##############This is where things start to get a bit sticky for me; I am 
>>> making some preliminary contrasts below, but am unsure of whether I've 
>>> treated the intercept appropriately.  I think it refers to groupLiv, 
>>> though I have some uncertainty about this, given that subject is the first 
>>> term in the model##########
>>> 
>>> #########ultimately, I  would like to compare (PG - (Liv+Test)/2) and so 
>>> on for each tissue type###########
>>> 
>>> my.contrast <- makeContrasts(
>> +  Liv=(Intercept)-(groupPG+groupTest)/2,
>> +  PG=groupPG-((Intercept)+groupTest)/2,
>> +  Test=groupTest-(groupPG+(Intercept))/2,
>> + levels=design4)
>> Warning message:
>> In makeContrasts(Liv = (Intercept) - (groupPG + groupTest)/2, PG = groupPG 
>> -  :
>>  Renaming (Intercept) to Intercept
>>> 
>>> 
>>> # fitting a glm
>>> fit <- glmFit(y4,design4)
>>> 
>>> 
>>> # likelihood ratio tests
>>> 
>>> lrt.Liv <- glmLRT(fit, my.contrast[,"Liv"])
>>> lrt.PG <- glmLRT(fit, my.contrast[,"PG"])
>>> lrt.Testes <- glmLRT(fit, my.contrast[,"Test"])
>>> 
>>> #finding top tags;
>>> topTags(lrt.Liv)
>> Error in abs(object$table$logFC) :
>>  Non-numeric argument to mathematical function
>>> topTags(lrt.PG)
>> Error in abs(object$table$logFC) :
>>  Non-numeric argument to mathematical function
>>> topTags(lrt.Testes)
>> Error in abs(object$table$logFC) :
>>  Non-numeric argument to mathematical function
>>> 
>>> 
>>> #### Note: I am also getting an error after using topTags about passing a 
>>> non-numeric argument to a mathermatical function.  I noticed there was a 
>>> previous post about similarly passing a single contrast to a glmLRT and 
>>> retrieving the same error, which was corrected in edgeR_3.0.7. As I am 
>>> using 3.0.8, I am assuming the error has more to do with how I have set 
>>> up/called my contrasts, but thought I would mention it.
>>> 
>>> sessionInfo()
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>> 
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>> 
>> attached base packages:
>> [1] splines   stats     graphics  grDevices utils     datasets  methods 
>> base
>> 
>> other attached packages:
>> [1] edgeR_3.0.8  limma_3.14.4
>> 
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.2
>> 
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}