[BioC] edgeR design matrix and contrast questions
Gordon K Smyth
smyth at wehi.EDU.AU
Thu Mar 7 02:59:23 CET 2013
I just noticed a typo: To test PG-(Liv+Test)/2 you need the contrast
groupPG-groupTest/2, rather than what I wrote below.
Gordon
On Thu, 7 Mar 2013, Gordon K Smyth wrote:
> Dear Findley,
>
> The fact that you have paired samples actually doesn't change the way that
> contrasts are formed in edgeR. Assuming you use the glm code, the same
> contrast code works as if you had no pairing.
>
> The design matrix is fine:
>
> design4 <- model.matrix(~subject + group)
>
> however your contrasts are incorrect. One never needs to include the
> intercept term in a contrast.
>
> The coefficients of interest in your model are called groupPG and groupTest.
> Both of these are relative to the Liv group, so you should think of groupPG
> as representing PG-Liv and groupTest as representing Test-Liv.
>
> Hence to test Liv-(PG+Test)/2 you want the contrast -(groupPG-groupTest)/2.
> To test PG-(Liv+Test)/2 you want the contrast groupPG-groupLiv/2. To test
> Test-(PG+Liv)/2 you want the contrast groupTest-groupPG/2.
>
> Best wishes
> Gordon
>
>
>> Date: Tue, 5 Mar 2013 15:21:18 -0500
>> From: Findley Finseth <findleyransler at gmail.com>
>> To: Bioconductor mailing list <bioconductor at r-project.org>
>> Subject: [BioC] edgeR design matrix and contrast questions
>>
>> Hello all --
>>
>>
>> I am using edgeR to analyze RNASeq data. Thank you so much for this
>> software and the clear and straightforward user guide. I have a question
>> regarding contrasts using the glm approach for a paired experimental
>> design.
>>
>> I have three tissues (PG, Liv, and Test) from six subjects (1-6). One of
>> my goals is to get a set of genes that are "enriched" for each tissue. To
>> this end, I would like to compare each tissue to the average of the other
>> two (e.g., PG - (liver + testes)/2), while accounting for the fact that my
>> experiment is paired by subject. I understand how to build an appropriate
>> design matrix and call contrasts to either compare each tissue to the
>> average of the other two (e.g., edgeR section 3.4.3-4) or perform pairwise
>> comparisons between tissues while including subject (e.g., edgeR section
>> 3.4.1) . However, I start struggling when I try to combine these goals and
>> compare one tissue to the average of the other two while pairing by
>> subject. I have also considered the multi-level approach (e.g., limma
>> section 8.7) but am not doing both independent and paired comparisons, so
>> did not think that was an appropriate avenue.
>>
>> I am pasting my session output with preliminary code below. Specifically, I
>> could use advice on whether 1) my design matrix and 2) the way I have
>> called my contrasts are appropriate for my question.
>>
>>
>> On a slightly different note, I am getting the following error when using
>> topTags (also on session output below):
>>
>> Error in abs(object$table$logFC) :
>> Non-numeric argument to mathematical function
>>
>> I noticed there was a previous post in December about similarly passing a
>> single contrast to a glmLRT and retrieving the same error, which was
>> corrected in edgeR_3.0.7. As I am using 3.0.8, I am assuming the error has
>> more to do with how I have set up/called my contrasts, but thought I would
>> mention it.
>>
>>
>>
>>
>> Thank you in advance,
>>
>> Findley Finseth
>>
>>
>> PhD Candidate
>>
>> Dept Ecology and Evolutionary Biology
>> Cornell University
>> Ithaca, NY 14850
>>
>>
>>
>>
>>
>>> library(edgeR)
>> Loading required package: limma
>>>
>>> # making a matrix of factors called "targets", Tissue = tissue, Ind2 =
>>> subject
>>> targets <-readTargets()
>>> targets
>> X Tissue Ind2
>> 1 JQ_112_Liv_TG Liv 1
>> 2 JQ_122-2_PG_A PG 2
>> 3 JQ_179_Liv_AC Liv 3
>> 4 JQ_191_Test_C Test 4
>> 5 JQ_255_PG_AGT PG 5
>> 6 JQ_107_Liv_CC Liv 6
>> 7 JQ_107_PG_AGT PG 6
>> 8 JQ_122-2_Test Test 2
>> 9 JQ_191_Liv_AT Liv 4
>> 10 JQ_107_Test_G Test 6
>> 11 JQ_112_PG_CGA PG 1
>> 12 JQ_112_Test_A Test 1
>> 13 JQ_122-2_Liv_ Liv 2
>> 14 JQ_191_PG_AGT PG 4
>> 15 JQ_179_PG_AGT PG 3
>> 16 JQ_179_Test_A Test 3
>> 17 JQ_255_Liv_GA Liv 5
>> 18 JQ_255_Test_G Test 5
>>>
>>> # importing raw data
>>> rawdata <- read.delim("controlonly_practiceEdgeR.txt")
>>> head(rawdata)
>> Gene sample2 sample6 sample7 sample9 sample10 sample12
>> sample13 sample17
>> 1 comp326924_c0_seq1 23 7 3 6 8 16
>> 5 6
>> 2 comp434050_c0_seq1 2 0 0 26 0 0
>> 0 34
>> 3 comp28996_c0_seq1 344 161 191 284 144 354
>> 114 338
>> 4 comp1083897_c0_seq1 1 4 1 0 1 2
>> 0 0
>> 5 comp544783_c0_seq1 3 0 0 4 0 0
>> 0 11
>> 6 comp654539_c0_seq1 0 0 0 9 0 1
>> 0 11
>> sample20 sample22 sample24 sample25 sample29 sample30 sample36 sample37
>> sample39
>> 1 6 20 4 8 10 2 1 4
>> 32
>> 2 0 49 0 10 0 0 0 19
>> 1
>> 3 154 815 196 163 252 51 74 168
>> 352
>> 4 0 1 4 0 0 2 0 0
>> 4
>> 5 1 19 0 2 2 0 0 3
>> 0
>> 6 1 4 0 5 0 0 0 3
>> 0
>> sample40
>> 1 2
>> 2 18
>> 3 182
>> 4 0
>> 5 2
>> 6 3
>>>
>>> # setting groups equal to tissue differences
>>> group <- targets$Tissue
>>> group <- as.factor(group)
>>> group
>> [1] Liv PG Liv Test PG Liv PG Test Liv Test PG Test Liv PG
>> PG Test Liv
>> [18] Test
>> Levels: Liv PG Test
>>>
>>>
>>> # making my DGE list
>>> y <- DGEList(counts=rawdata[,2:19],genes=rawdata[,1],group=group)
>> Calculating library sizes from column totals.
>>>
>>> # filtering out lowly expressed genes; since the smallest group size is
>>> six, we keeping genes with at least one count per million in at least six
>>> samples
>>> keep <- rowSums(cpm(y)>1) >= 6
>>> y <- y[keep,]
>>> dim(y) # retained 30,140 genes
>> [1] 30140 18
>>>
>>>
>>> #recomputing the library sizes:
>>> y$samples$lib.size <- colSums(y$counts)
>>>
>>> #calculating normalization factors
>>> y <- calcNormFactors(y)
>>> y$samples # looks good
>> group lib.size norm.factors
>> sample2 Liv 13003017 1.2319091
>> sample6 PG 12388919 0.7524641
>> sample7 Liv 8045685 0.6490458
>> sample9 Test 9164420 1.5566202
>> sample10 PG 15292051 0.5119163
>> sample12 Liv 13042773 0.8596694
>> sample13 PG 11862830 0.6419963
>> sample17 Test 9472129 1.6400614
>> sample20 Liv 6514347 0.8206096
>> sample22 Test 25926605 1.4096284
>> sample24 PG 12334446 0.7888070
>> sample25 Test 6283137 1.6230035
>> sample29 Liv 19970487 0.8060990
>> sample30 PG 6159314 0.7825644
>> sample36 PG 3897369 0.9081739
>> sample37 Test 5778563 1.6550333
>> sample39 Liv 15619497 0.9618765
>> sample40 Test 5076345 1.7061590
>>>
>>>
>>>
>>> # setting subject equal to individuals to reflect fact that samples are
>>> paired
>>> subject <- factor(targets$Ind2)
>>>
>>> ############### building design matrix; could use advice here regarding
>>> appropriateness##########
>>> design4 <- model.matrix(~subject + group)
>>> design4
>> (Intercept) subject2 subject3 subject4 subject5 subject6 groupPG
>> groupTest
>> 1 1 0 0 0 0 0 0
>> 0
>> 2 1 1 0 0 0 0 1
>> 0
>> 3 1 0 1 0 0 0 0
>> 0
>> 4 1 0 0 1 0 0 0
>> 1
>> 5 1 0 0 0 1 0 1
>> 0
>> 6 1 0 0 0 0 1 0
>> 0
>> 7 1 0 0 0 0 1 1
>> 0
>> 8 1 1 0 0 0 0 0
>> 1
>> 9 1 0 0 1 0 0 0
>> 0
>> 10 1 0 0 0 0 1 0
>> 1
>> 11 1 0 0 0 0 0 1
>> 0
>> 12 1 0 0 0 0 0 0
>> 1
>> 13 1 1 0 0 0 0 0
>> 0
>> 14 1 0 0 1 0 0 1
>> 0
>> 15 1 0 1 0 0 0 1
>> 0
>> 16 1 0 1 0 0 0 0
>> 1
>> 17 1 0 0 0 1 0 0
>> 0
>> 18 1 0 0 0 1 0 0
>> 1
>> attr(,"assign")
>> [1] 0 1 1 1 1 1 2 2
>> attr(,"contrasts")
>> attr(,"contrasts")$subject
>> [1] "contr.treatment"
>>
>> attr(,"contrasts")$group
>> [1] "contr.treatment"
>>
>>>
>>> # estimating dispersions
>>> y4 <- estimateGLMCommonDisp(y,design4)
>>> y4 <- estimateGLMTrendedDisp(y4,design4)
>> Loading required package: splines
>>> y4 <- estimateGLMTagwiseDisp(y4,design4)
>>>
>>> # building contrasts
>>>
>>> ##############This is where things start to get a bit sticky for me; I am
>>> making some preliminary contrasts below, but am unsure of whether I've
>>> treated the intercept appropriately. I think it refers to groupLiv,
>>> though I have some uncertainty about this, given that subject is the first
>>> term in the model##########
>>>
>>> #########ultimately, I would like to compare (PG - (Liv+Test)/2) and so
>>> on for each tissue type###########
>>>
>>> my.contrast <- makeContrasts(
>> + Liv=(Intercept)-(groupPG+groupTest)/2,
>> + PG=groupPG-((Intercept)+groupTest)/2,
>> + Test=groupTest-(groupPG+(Intercept))/2,
>> + levels=design4)
>> Warning message:
>> In makeContrasts(Liv = (Intercept) - (groupPG + groupTest)/2, PG = groupPG
>> - :
>> Renaming (Intercept) to Intercept
>>>
>>>
>>> # fitting a glm
>>> fit <- glmFit(y4,design4)
>>>
>>>
>>> # likelihood ratio tests
>>>
>>> lrt.Liv <- glmLRT(fit, my.contrast[,"Liv"])
>>> lrt.PG <- glmLRT(fit, my.contrast[,"PG"])
>>> lrt.Testes <- glmLRT(fit, my.contrast[,"Test"])
>>>
>>> #finding top tags;
>>> topTags(lrt.Liv)
>> Error in abs(object$table$logFC) :
>> Non-numeric argument to mathematical function
>>> topTags(lrt.PG)
>> Error in abs(object$table$logFC) :
>> Non-numeric argument to mathematical function
>>> topTags(lrt.Testes)
>> Error in abs(object$table$logFC) :
>> Non-numeric argument to mathematical function
>>>
>>>
>>> #### Note: I am also getting an error after using topTags about passing a
>>> non-numeric argument to a mathermatical function. I noticed there was a
>>> previous post about similarly passing a single contrast to a glmLRT and
>>> retrieving the same error, which was corrected in edgeR_3.0.7. As I am
>>> using 3.0.8, I am assuming the error has more to do with how I have set
>>> up/called my contrasts, but thought I would mention it.
>>>
>>> sessionInfo()
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] splines stats graphics grDevices utils datasets methods
>> base
>>
>> other attached packages:
>> [1] edgeR_3.0.8 limma_3.14.4
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.15.2
>>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list