[BioC] edgeR: design matrix for different condition
KJ Lim
jinkeanlim at gmail.com
Tue Aug 7 14:30:18 CEST 2012
Dear Prof Gordon,
Thanks for the correction, you are right about the genotype label. I
managed to carry out the estimate dispersion for my data.
Thank you so much for your time and help.
Best regards,
KJ Lim
On 6 August 2012 10:32, KJ Lim <jinkeanlim at gmail.com> wrote:
> Dear Prof Gordon,
>
> Thank you very much for your prompt replied.
>
> You are right about the genotype, I have two genotypes for my
> experiment. In this case, the mistake I made was label them
> differently. I will correct them and continue with the estimate
> dispersion as soon as I back to office.
>
> Thanks a lot for your advice, your time as well as your help. I
> appreciate that very much.
>
> Have a nice day.
>
> Best regards,
> KJ Lim
>
>
> On 6 August 2012 10:11, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>> Dear KJ Lim,
>>
>> Well, your "tree" column has 16 different entries, a different entry for
>> every library. Naturally this causes a problem.
>>
>> So either you have no replication of any experimental condition, or you have
>> mistakingly used different labels for the same condition.
>>
>> In your earlier email below, you claimed to have just two genotypes (tree=HS
>> and tree=LS), but in your latest targets file you have 16 different
>> genotypes. I'm guessing that the genotypes are actually H and L, so the
>> entries in the targets file should be just H and L.
>>
>> That's as much advice as I have time to give you, and I hope that others
>> will help you further. I feel that I've given you good general advice that
>> is applicable to your experiment.
>>
>>
>> Best wishes
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> Tel: (03) 9345 2326, Fax (03) 9347 0852,
>> http://www.statsci.org/smyth
>>
>> On Mon, 6 Aug 2012, KJ Lim wrote:
>>
>>> Dear Prof Gordon,
>>>
>>> Good day. Thanks for you prompt replied.
>>>
>>> Below is the output of cbind(targets,Group=Group)
>>>
>>>> cbind(targets,Group=Group)
>>>
>>> files treat tree X Group
>>> 1 ./rawData/HS0H01.txt 00H H1 NA 00H.H1
>>> 2 ./rawData/HS0H02.txt 00H H2 NA 00H.H2
>>> 3 ./rawData/HS3H01.txt 03H H3 NA 03H.H3
>>> 4 ./rawData/HS3H02.txt 03H H4 NA 03H.H4
>>> 5 ./rawData/HS1D01.txt 24H H5 NA 24H.H5
>>> 6 ./rawData/HS1D02.txt 24H H6 NA 24H.H6
>>> 7 ./rawData/HS4D01.txt 96H H7 NA 96H.H7
>>> 8 ./rawData/HS4D02.txt 96H H8 NA 96H.H8
>>> 9 ./rawData/LS0H01.txt 00H L1 NA 00H.L1
>>> 10 ./rawData/LS0H02.txt 00H L2 NA 00H.L2
>>> 11 ./rawData/LS3H01.txt 03H L3 NA 03H.L3
>>> 12 ./rawData/LS3H02.txt 03H L4 NA 03H.L4
>>> 13 ./rawData/LS1D01.txt 24H L5 NA 24H.L5
>>> 14 ./rawData/LS1D02.txt 24H L6 NA 24H.L6
>>> 15 ./rawData/LS4D01.txt 96H L7 NA 96H.L7
>>> 16 ./rawData/LS4D02.txt 96H L8 NA 96H.L8
>>>
>>> Thank you for your time.
>>>
>>> Best regards,
>>> KJ Lim
>>>
>>>
>>> On 6 August 2012 07:33, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>>
>>>> The error message is pretty self-explanatory. Apparently your design
>>>> matrix
>>>> has as many columns as there are libraries, so there are no degrees of
>>>> freedom left from which to estimate variability. According to the code
>>>> and
>>>> output you have given, this message should be impossible, so you must
>>>> have
>>>> changed the data in some way that I cannot see.
>>>>
>>>> If you had shown the output from cbind(targets,Group=Group), then I might
>>>> have been able to say something useful.
>>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>>
>>>> On Mon, 6 Aug 2012, KJ Lim wrote:
>>>>
>>>>> Dear Prof Gordon,
>>>>>
>>>>> Good day.
>>>>>
>>>>> Thanks for your time to update the edgeR User's Guide. It is useful.
>>>>>
>>>>> I combined all my experiment factors into one combined factor and
>>>>> described the experiment matrix like the example in the User's guide:
>>>>>
>>>>> > Group <- factor(paste(targets$treat, targets$tree,sep="."))
>>>>> > cbind(targets,Group=Group)
>>>>> > hl.design <- model.matrix(~0+Group)
>>>>>
>>>>> But, I encountered an error when I performed the estimate dispersion
>>>>> for my data
>>>>>
>>>>> > hl <- estimateGLMCommonDisp(hl, hl.design)
>>>>> Warning message:
>>>>> In estimateGLMCommonDisp.default(y = y$counts, design = design, :
>>>>> No residual df: setting dispersion to NA
>>>>>
>>>>> I tried to figure out what I have done wrong, unfortunately, I have no
>>>>> luck on that. Could you or the community kindly please light me for
>>>>> this matter?
>>>>>
>>>>> Thank you very much for your time.
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>>
>>>>> R version 2.15.1 (2012-06-22)
>>>>> Platform: i486-pc-linux-gnu (32-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
>>>>> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] edgeR_2.6.10 limma_3.12.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tcltk_2.15.1 tools_2.15.1
>>>>>
>>>>> Best regards,
>>>>> KJ Lim
>>>>>
>>>>>
>>>>>
>>>>> On 31 July 2012 02:24, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>>>>
>>>>>>
>>>>>> Dear KH Kim,
>>>>>>
>>>>>> No, you have not understood me correctly. I did not suggest that you
>>>>>> change
>>>>>> from edgeR to limma. I suggested that you read the limma documentation
>>>>>> because the design matrix is the same for edgeR as it is for limma, so
>>>>>> the
>>>>>> limma documentation would help you create the design matrix for your
>>>>>> edgeR
>>>>>> analysis.
>>>>>>
>>>>>> I thought from your last email that you had already done this and that
>>>>>> you
>>>>>> had completed the edgeR analysis satisfactorily.
>>>>>>
>>>>>> I wrote more documentation for the edgeR User's Guide a couple of days
>>>>>> ago,
>>>>>> trying to give advice for experiments such as yours. You might find
>>>>>> that
>>>>>> Section 3.3 of
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
>>>>>>
>>>>>> gives more explanation.
>>>>>>
>>>>>> Best wishes
>>>>>> Gordon
>>>>>>
>>>>>> ---------------------------------------------
>>>>>> Professor Gordon K Smyth,
>>>>>> Bioinformatics Division,
>>>>>> Walter and Eliza Hall Institute of Medical Research,
>>>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>>> Tel: (03) 9345 2326, Fax (03) 9347 0852,
>>>>>> http://www.statsci.org/smyth
>>>>>>
>>>>>>
>>>>>> On Mon, 30 Jul 2012, KJ Lim wrote:
>>>>>>
>>>>>>> Dear Prof Gordon,
>>>>>>>
>>>>>>> Good day.
>>>>>>>
>>>>>>> I have read the section 8.5 of the Limma manual as you suggested in
>>>>>>> previous emails. Thanks.
>>>>>>>
>>>>>>> If I understand you correctly, you would suggest me to carry out my DE
>>>>>>> analysis with limma package if I would like to learn the which genes
>>>>>>> are express in treeHS compare to treeLS (vice versa) at i.e. 24H.
>>>>>>>
>>>>>>> May I ask how can I generate the EList, "eset", in order to fit in the
>>>>>>> linear model as mentioned in the limma manual
>>>>>>>
>>>>>>> fit <- lmFit(eset, design)
>>>>>>> fit <- eBayes(fit)
>>>>>>>
>>>>>>> Please correct me if I'm wrong.
>>>>>>>
>>>>>>> Thank you very much for your time and help.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> KJ Lim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 27 July 2012 02:31, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Dear KJ Lim,
>>>>>>>>
>>>>>>>> Thanks for the rephrasing, which is clearer. I would have liked you
>>>>>>>> to
>>>>>>>> mention however whether you read the documentation that I refered you
>>>>>>>> do.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 26 Jul 2012, KJ Lim wrote:
>>>>>>>>
>>>>>>>>> Dear Prof Gordon,
>>>>>>>>>
>>>>>>>>> Good day. Thanks for your prompt replied.
>>>>>>>>>
>>>>>>>>> Please allow me to rephrase my previous question.
>>>>>>>>>
>>>>>>>>> This model, ~tree+treat, is assumed effect of the time of
>>>>>>>>> treatment
>>>>>>>>> will be the same irrespective of genotype.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, this model makes that assumption.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thus, test for the coef=3 will gives me the differential expression
>>>>>>>>> at
>>>>>>>>> 96H
>>>>>>>>> irrespective of genotype.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually coef=3 refers to 24H, according to the design matrix in your
>>>>>>>> original email. A test for coef=3 would test for DE at 24H vs 0H,
>>>>>>>> irrespective of genotype. But only if the assumption mentioned above
>>>>>>>> is
>>>>>>>> true, which is unlikely given the rest of your email.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I would like to learn the differential expression of treeHS vs
>>>>>>>>> treeLS
>>>>>>>>> at specific time points i.e. 24H or if possible across the whole
>>>>>>>>> time
>>>>>>>>> points. Should I fit my model as
>>>>>>>>>
>>>>>>>>> model A: ~tree*treat OR model B: ~tree+tree:treat ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> These models are equivalent, so it is just a matter of convenience
>>>>>>>> which
>>>>>>>> one
>>>>>>>> you use, as I tried to explain in the limma documentation I refered
>>>>>>>> you
>>>>>>>> to.
>>>>>>>>
>>>>>>>>
>>>>>>>>> The design matrix columns of model B give:
>>>>>>>>> "(Intercept)" "treeLS" "treeHS:treat03H" "treeLS:treat03H"
>>>>>>>>> "treeHS:treat24H" "treeLS:treat24H" "treeHS:treat96H"
>>>>>>>>> "treeLS:treat96H"
>>>>>>>>>
>>>>>>>>> Am I doing right if I fit the model B and test for the coef=3:4 to
>>>>>>>>> learn the differential expression of treeHS vs treeLS at specific
>>>>>>>>> time
>>>>>>>>> points i.e. 03H? Does this test could tells me which set of genes
>>>>>>>>> up/down-regulated in treeLS or treeHS?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes. Coef 3 tests for a 3H effect in HS. Coef 4 tests for a 3H in
>>>>>>>> LS.
>>>>>>>> So
>>>>>>>> testing for both coefficients coef=3:4 tests for a 3H effect in
>>>>>>>> either
>>>>>>>> treeLS or treeHS.
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Gordon
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm not good in statistic and I'm learning, kindly please correct
>>>>>>>>> me
>>>>>>>>> if I'm wrong.
>>>>>>>>>
>>>>>>>>> Thank you very much for your time and suggestion.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> KJ Lim
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 26 July 2012 03:39, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Dear KJ Lim,
>>>>>>>>>>
>>>>>>>>>> I don't quite understand your question, because you seem to be
>>>>>>>>>> asking
>>>>>>>>>> for
>>>>>>>>>> something that isn't a test for differential expression, which is
>>>>>>>>>> what
>>>>>>>>>> edgeR
>>>>>>>>>> does. You question "I would like to learn the genes are express in
>>>>>>>>>> treeHS
>>>>>>>>>> but not treeLS and vice versa" seems to be about absolute
>>>>>>>>>> expression
>>>>>>>>>> levels.
>>>>>>>>>> edgeR doesn't test for genes that are not expressed in particular
>>>>>>>>>> conditions.
>>>>>>>>>>
>>>>>>>>>> I'll refer you to the limma section on interaction models in case
>>>>>>>>>> that
>>>>>>>>>> helps, see Section 8.5 of:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://bioconductor.org/packages/2.11/bioc/vignettes/limma/inst/doc/usersguide.pdf
>>>>>>>>>>
>>>>>>>>>> Setting up a design matrix is the same for edgeR as it is for
>>>>>>>>>> limma.
>>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>> Gordon
>>>>>>>>>>
>>>>>>>>>>> Date: Tue, 24 Jul 2012 17:12:00 +0300
>>>>>>>>>>> From: KJ Lim <jinkeanlim at gmail.com>
>>>>>>>>>>> To: Bioconductor mailing list <bioconductor at r-project.org>
>>>>>>>>>>> Subject: [BioC] edgeR: design matrix for different condition
>>>>>>>>>>>
>>>>>>>>>>> Dear the edgeR community,
>>>>>>>>>>>
>>>>>>>>>>> Good day.
>>>>>>>>>>>
>>>>>>>>>>> I'm analyzing my RNA-Seq experiment with edgeR. My study has 2
>>>>>>>>>>> different genotypes (treeHS, treeLS) and time of treatment (0H,
>>>>>>>>>>> 3H,
>>>>>>>>>>> 24H, 96H).
>>>>>>>>>>>
>>>>>>>>>>> I first assumed that the treatment effect will be the same for
>>>>>>>>>>> each
>>>>>>>>>>> genotype and I have the design matrix as:
>>>>>>>>>>>
>>>>>>>>>>> design <- model.matrix(~tree+treat)
>>>>>>>>>>>
>>>>>>>>>>>> design
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (Intercept) treat03H treat24H treat96H treeLS
>>>>>>>>>>> 1 1 0 0 0 0
>>>>>>>>>>> 2 1 0 0 0 0
>>>>>>>>>>> 3 1 1 0 0 0
>>>>>>>>>>> 4 1 1 0 0 0
>>>>>>>>>>> 5 1 0 1 0 0
>>>>>>>>>>> 6 1 0 1 0 0
>>>>>>>>>>> 7 1 0 0 1 0
>>>>>>>>>>> 8 1 0 0 1 0
>>>>>>>>>>> 9 1 0 0 0 1
>>>>>>>>>>> 10 1 0 0 0 1
>>>>>>>>>>> 11 1 1 0 0 1
>>>>>>>>>>> 12 1 1 0 0 1
>>>>>>>>>>> 13 1 0 1 0 1
>>>>>>>>>>> 14 1 0 1 0 1
>>>>>>>>>>> 15 1 0 0 1 1
>>>>>>>>>>> 16 1 0 0 1 1
>>>>>>>>>>>
>>>>>>>>>>> I used coef=4 to test for the differential expressions between
>>>>>>>>>>> treeHS
>>>>>>>>>>> and treeLS within the time of treatment, coef=3 to learn the
>>>>>>>>>>> differential expressions in 2 genotypes at time of treatment 96H.
>>>>>>>>>>>
>>>>>>>>>>> ** I would like to learn the genes are express in treeHS but not
>>>>>>>>>>> treeLS and vice versa at certain time of treatment let's say 24H
>>>>>>>>>>> or
>>>>>>>>>>> across the whole time of treatment, should I have the design
>>>>>>>>>>> matrix
>>>>>>>>>>> as
>>>>>>>>>>> below or more steps I need to do?
>>>>>>>>>>>
>>>>>>>>>>> design <- model.matrix(~tree*treat)
>>>>>>>>>>>
>>>>>>>>>>> Kindly please light me on this. I appreciate very much for your
>>>>>>>>>>> help
>>>>>>>>>>> and
>>>>>>>>>>> time.
>>>>>>>>>>>
>>>>>>>>>>> Have a nice day.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> KJ Lim
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
More information about the Bioconductor
mailing list