[BioC] edgeR: design matrix for different condition

Mon Aug 6 09:32:02 CEST 2012

Dear Prof Gordon,

Thank you very much for your prompt replied.

You are right about the genotype, I have two genotypes for my
experiment. In this case, the mistake I made was label them
differently. I will correct them and continue with the estimate
dispersion as soon as I back to office.

Thanks a lot for your advice, your time as well as your help. I
appreciate that very much.

Have a nice day.

Best regards,
KJ Lim

On 6 August 2012 10:11, Gordon K Smyth <smyth at wehi.edu.au> wrote:
> Dear KJ Lim,
>
> Well, your "tree" column has 16 different entries, a different entry for
> every library.  Naturally this causes a problem.
>
> So either you have no replication of any experimental condition, or you have
> mistakingly used different labels for the same condition.
>
> In your earlier email below, you claimed to have just two genotypes (tree=HS
> and tree=LS), but in your latest targets file you have 16 different
> genotypes.  I'm guessing that the genotypes are actually H and L, so the
> entries in the targets file should be just H and L.
>
> That's as much advice as I have time to give you, and I hope that others
> will help you further.  I feel that I've given you good general advice that
> is applicable to your experiment.
>
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Tel: (03) 9345 2326, Fax (03) 9347 0852,
> http://www.statsci.org/smyth
>
> On Mon, 6 Aug 2012, KJ Lim wrote:
>
>> Dear Prof Gordon,
>>
>> Good day. Thanks for you prompt replied.
>>
>> Below is the output of cbind(targets,Group=Group)
>>
>>> cbind(targets,Group=Group)
>>
>>                  files treat tree  X  Group
>> 1  ./rawData/HS0H01.txt   00H   H1 NA 00H.H1
>> 2  ./rawData/HS0H02.txt   00H   H2 NA 00H.H2
>> 3  ./rawData/HS3H01.txt   03H   H3 NA 03H.H3
>> 4  ./rawData/HS3H02.txt   03H   H4 NA 03H.H4
>> 5  ./rawData/HS1D01.txt   24H   H5 NA 24H.H5
>> 6  ./rawData/HS1D02.txt   24H   H6 NA 24H.H6
>> 7  ./rawData/HS4D01.txt   96H   H7 NA 96H.H7
>> 8  ./rawData/HS4D02.txt   96H   H8 NA 96H.H8
>> 9  ./rawData/LS0H01.txt   00H   L1 NA 00H.L1
>> 10 ./rawData/LS0H02.txt   00H   L2 NA 00H.L2
>> 11 ./rawData/LS3H01.txt   03H   L3 NA 03H.L3
>> 12 ./rawData/LS3H02.txt   03H   L4 NA 03H.L4
>> 13 ./rawData/LS1D01.txt   24H   L5 NA 24H.L5
>> 14 ./rawData/LS1D02.txt   24H   L6 NA 24H.L6
>> 15 ./rawData/LS4D01.txt   96H   L7 NA 96H.L7
>> 16 ./rawData/LS4D02.txt   96H   L8 NA 96H.L8
>>
>> Thank you for your time.
>>
>> Best regards,
>> KJ Lim
>>
>>
>> On 6 August 2012 07:33, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>
>>> The error message is pretty self-explanatory.  Apparently your design
>>> matrix
>>> has as many columns as there are libraries, so there are no degrees of
>>> freedom left from which to estimate variability.  According to the code
>>> and
>>> output you have given, this message should be impossible, so you must
>>> have
>>> changed the data in some way that I cannot see.
>>>
>>> If you had shown the output from cbind(targets,Group=Group), then I might
>>> have been able to say something useful.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>
>>> On Mon, 6 Aug 2012, KJ Lim wrote:
>>>
>>>> Dear Prof Gordon,
>>>>
>>>> Good day.
>>>>
>>>> Thanks for your time to update the edgeR User's Guide. It is useful.
>>>>
>>>> I combined all my experiment factors into one combined factor and
>>>> described the experiment matrix like the example in the User's guide:
>>>>
>>>> > Group <- factor(paste(targets$treat, targets$tree,sep="."))
>>>> > cbind(targets,Group=Group)
>>>> > hl.design <- model.matrix(~0+Group)
>>>>
>>>> But, I encountered an error when I performed the estimate dispersion
>>>> for my data
>>>>
>>>> > hl <- estimateGLMCommonDisp(hl, hl.design)
>>>>  Warning message:
>>>>  In estimateGLMCommonDisp.default(y = y$counts, design = design,  :
>>>>    No residual df: setting dispersion to NA
>>>>
>>>> I tried to figure out what I have done wrong, unfortunately, I have no
>>>> luck on that. Could you or the community kindly please light me for
>>>> this matter?
>>>>
>>>> Thank you very much for your time.
>>>>
>>>>> sessionInfo()
>>>>
>>>>
>>>> R version 2.15.1 (2012-06-22)
>>>> Platform: i486-pc-linux-gnu (32-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
>>>> [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
>>>> [7] LC_PAPER=C                LC_NAME=C
>>>> [9] LC_ADDRESS=C              LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] edgeR_2.6.10 limma_3.12.1
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] tcltk_2.15.1 tools_2.15.1
>>>>
>>>> Best regards,
>>>> KJ Lim
>>>>
>>>>
>>>>
>>>> On 31 July 2012 02:24, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>>>
>>>>>
>>>>> Dear KH Kim,
>>>>>
>>>>> No, you have not understood me correctly.  I did not suggest that you
>>>>> change
>>>>> from edgeR to limma.  I suggested that you read the limma documentation
>>>>> because the design matrix is the same for edgeR as it is for limma, so
>>>>> the
>>>>> limma documentation would help you create the design matrix for your
>>>>> edgeR
>>>>> analysis.
>>>>>
>>>>> I thought from your last email that you had already done this and that
>>>>> you
>>>>> had completed the edgeR analysis satisfactorily.
>>>>>
>>>>> I wrote more documentation for the edgeR User's Guide a couple of days
>>>>> ago,
>>>>> trying to give advice for experiments such as yours.  You might find
>>>>> that
>>>>> Section 3.3 of
>>>>>
>>>>>
>>>>>
>>>>> http://bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
>>>>>
>>>>> gives more explanation.
>>>>>
>>>>> Best wishes
>>>>> Gordon
>>>>>
>>>>> ---------------------------------------------
>>>>> Professor Gordon K Smyth,
>>>>> Bioinformatics Division,
>>>>> Walter and Eliza Hall Institute of Medical Research,
>>>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>>>> Tel: (03) 9345 2326, Fax (03) 9347 0852,
>>>>> http://www.statsci.org/smyth
>>>>>
>>>>>
>>>>> On Mon, 30 Jul 2012, KJ Lim wrote:
>>>>>
>>>>>> Dear Prof Gordon,
>>>>>>
>>>>>> Good day.
>>>>>>
>>>>>> I have read the section 8.5 of the Limma manual as you suggested in
>>>>>> previous emails. Thanks.
>>>>>>
>>>>>> If I understand you correctly, you would suggest me to carry out my DE
>>>>>> analysis with limma package if I would like to learn the which genes
>>>>>> are express in treeHS compare to treeLS (vice versa) at i.e. 24H.
>>>>>>
>>>>>> May I ask how can I generate the EList, "eset", in order to fit in the
>>>>>> linear model as mentioned in the limma manual
>>>>>>
>>>>>>  fit <- lmFit(eset, design)
>>>>>>  fit <- eBayes(fit)
>>>>>>
>>>>>> Please correct me if I'm wrong.
>>>>>>
>>>>>> Thank you very much for your time and help.
>>>>>>
>>>>>> Best regards,
>>>>>> KJ Lim
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 27 July 2012 02:31, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dear KJ Lim,
>>>>>>>
>>>>>>> Thanks for the rephrasing, which is clearer.  I would have liked you
>>>>>>> to
>>>>>>> mention however whether you read the documentation that I refered you
>>>>>>> do.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 26 Jul 2012, KJ Lim wrote:
>>>>>>>
>>>>>>>> Dear Prof Gordon,
>>>>>>>>
>>>>>>>> Good day. Thanks for your prompt replied.
>>>>>>>>
>>>>>>>> Please allow me to rephrase my previous question.
>>>>>>>>
>>>>>>>> This model,  ~tree+treat,  is assumed effect of the time of
>>>>>>>> treatment
>>>>>>>> will be the same irrespective of genotype.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Yes, this model makes that assumption.
>>>>>>>
>>>>>>>
>>>>>>>> Thus, test for the coef=3 will gives me the differential expression
>>>>>>>> at
>>>>>>>> 96H
>>>>>>>> irrespective of genotype.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Actually coef=3 refers to 24H, according to the design matrix in your
>>>>>>> original email.  A test for coef=3 would test for DE at 24H vs 0H,
>>>>>>> irrespective of genotype.  But only if the assumption mentioned above
>>>>>>> is
>>>>>>> true, which is unlikely given the rest of your email.
>>>>>>>
>>>>>>>
>>>>>>>> I would like to learn the differential expression of treeHS vs
>>>>>>>> treeLS
>>>>>>>> at specific time points i.e. 24H or if possible across the whole
>>>>>>>> time
>>>>>>>> points. Should I fit my model as
>>>>>>>>
>>>>>>>>  model A: ~tree*treat    OR   model B: ~tree+tree:treat ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> These models are equivalent, so it is just a matter of convenience
>>>>>>> which
>>>>>>> one
>>>>>>> you use, as I tried to explain in the limma documentation I refered
>>>>>>> you
>>>>>>> to.
>>>>>>>
>>>>>>>
>>>>>>>> The design matrix columns of model B give:
>>>>>>>>  "(Intercept)"  "treeLS"  "treeHS:treat03H"  "treeLS:treat03H"
>>>>>>>> "treeHS:treat24H"  "treeLS:treat24H"  "treeHS:treat96H"
>>>>>>>> "treeLS:treat96H"
>>>>>>>>
>>>>>>>> Am I doing right if I fit the model B and test for the coef=3:4 to
>>>>>>>> learn the differential expression of treeHS vs treeLS at specific
>>>>>>>> time
>>>>>>>> points i.e. 03H? Does this test could tells me which set of genes
>>>>>>>> up/down-regulated in treeLS or treeHS?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Yes.  Coef 3 tests for a 3H effect in HS.  Coef 4 tests for a 3H in
>>>>>>> LS.
>>>>>>> So
>>>>>>> testing for both coefficients coef=3:4 tests for a 3H effect in
>>>>>>> either
>>>>>>> treeLS or treeHS.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Gordon
>>>>>>>
>>>>>>>
>>>>>>>> I'm not good in statistic and I'm learning,  kindly please correct
>>>>>>>> me
>>>>>>>> if I'm wrong.
>>>>>>>>
>>>>>>>> Thank you very much for your time and suggestion.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> KJ Lim
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 26 July 2012 03:39, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear KJ Lim,
>>>>>>>>>
>>>>>>>>> I don't quite understand your question, because you seem to be
>>>>>>>>> asking
>>>>>>>>> for
>>>>>>>>> something that isn't a test for differential expression, which is
>>>>>>>>> what
>>>>>>>>> edgeR
>>>>>>>>> does.  You question "I would like to learn the genes are express in
>>>>>>>>> treeHS
>>>>>>>>> but not treeLS and vice versa" seems to be about absolute
>>>>>>>>> expression
>>>>>>>>> levels.
>>>>>>>>> edgeR doesn't test for genes that are not expressed in particular
>>>>>>>>> conditions.
>>>>>>>>>
>>>>>>>>> I'll refer you to the limma section on interaction models in case
>>>>>>>>> that
>>>>>>>>> helps, see Section 8.5 of:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://bioconductor.org/packages/2.11/bioc/vignettes/limma/inst/doc/usersguide.pdf
>>>>>>>>>
>>>>>>>>> Setting up a design matrix is the same for edgeR as it is for
>>>>>>>>> limma.
>>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>> Gordon
>>>>>>>>>
>>>>>>>>>> Date: Tue, 24 Jul 2012 17:12:00 +0300
>>>>>>>>>> From: KJ Lim <jinkeanlim at gmail.com>
>>>>>>>>>> To: Bioconductor mailing list <bioconductor at r-project.org>
>>>>>>>>>> Subject: [BioC] edgeR: design matrix for different condition
>>>>>>>>>>
>>>>>>>>>> Dear the edgeR community,
>>>>>>>>>>
>>>>>>>>>> Good day.
>>>>>>>>>>
>>>>>>>>>> I'm analyzing my RNA-Seq experiment with edgeR. My study has 2
>>>>>>>>>> different genotypes (treeHS, treeLS) and time of treatment (0H,
>>>>>>>>>> 3H,
>>>>>>>>>> 24H, 96H).
>>>>>>>>>>
>>>>>>>>>> I first assumed that the treatment effect will be the same for
>>>>>>>>>> each
>>>>>>>>>> genotype and I have the design matrix as:
>>>>>>>>>>
>>>>>>>>>>  design <- model.matrix(~tree+treat)
>>>>>>>>>>
>>>>>>>>>>> design
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   (Intercept)  treat03H  treat24H  treat96H  treeLS
>>>>>>>>>> 1             1        0        0        0      0
>>>>>>>>>> 2             1        0        0        0      0
>>>>>>>>>> 3             1        1        0        0      0
>>>>>>>>>> 4             1        1        0        0      0
>>>>>>>>>> 5             1        0        1        0      0
>>>>>>>>>> 6             1        0        1        0      0
>>>>>>>>>> 7             1        0        0        1      0
>>>>>>>>>> 8             1        0        0        1      0
>>>>>>>>>> 9             1        0        0        0      1
>>>>>>>>>> 10           1        0        0        0      1
>>>>>>>>>> 11           1        1        0        0      1
>>>>>>>>>> 12           1        1        0        0      1
>>>>>>>>>> 13           1        0        1        0      1
>>>>>>>>>> 14           1        0        1        0      1
>>>>>>>>>> 15           1        0        0        1      1
>>>>>>>>>> 16           1        0        0        1      1
>>>>>>>>>>
>>>>>>>>>> I used coef=4 to test for the differential expressions between
>>>>>>>>>> treeHS
>>>>>>>>>> and treeLS within the time of treatment, coef=3 to learn the
>>>>>>>>>> differential expressions in 2 genotypes at time of treatment 96H.
>>>>>>>>>>
>>>>>>>>>> ** I would like to learn the genes are express in treeHS but not
>>>>>>>>>> treeLS and vice versa at certain time of treatment let's say 24H
>>>>>>>>>> or
>>>>>>>>>> across the whole time of treatment, should I have the design
>>>>>>>>>> matrix
>>>>>>>>>> as
>>>>>>>>>> below or more steps I need to do?
>>>>>>>>>>
>>>>>>>>>>  design <- model.matrix(~tree*treat)
>>>>>>>>>>
>>>>>>>>>> Kindly please light me on this. I appreciate very much for your
>>>>>>>>>> help
>>>>>>>>>> and
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> Have a nice day.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> KJ Lim
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}