[BioC] Repeated Measures mRNA expression analysis
Gordon K Smyth
smyth at wehi.EDU.AU
Wed Jul 3 09:12:28 CEST 2013
On Tue, 2 Jul 2013, Charles Determan Jr wrote:
> I made an error in my last response, my subject was not set as a factor.
> The design matrix looks like the matrix below. Perhaps a shorter
> question that will satisfy me is what I do with all the 'subject'
> coefficients. If I am only interested in the genes that respond
> differently to time in treated vs. control subjects do I simply ignore
> the subject coefficients?
Yes, you ignore them. They take care of the subject baseline effects, but
these are not of interest to you.
> Lastly, am I correct that to determine genes that respond differently to
> time in treated vs. control subjects I simply conduct the contrasts
> between the last coefficients (i.e. 40-39 and 38-37).
Yes.
> Apologies for turning this into such a long post, I hope it is helpful
> for others as well.
To find genes DE at time2 in the controls:
coef = "group.Control:times.time2"
To find genes DE at time2 in the treated subjects:
coef = "group.Treatment:times.time2"
To find genes DE at time3 in the controls:
coef = "group.Control:times.time3"
To find genes DE at time3 in the treated subjects:
coef = "group.reatment:times.time3"
Best wishes
Gordon
> Thanks, Charles
>
>> colnames(design)
> [1] "(Intercept)" "group.Treatment"
> [3] "group.Control:subject.2" "group.Treatment:subject.2"
> [5] "group.Control:subject.3" "group.Treatment:subject.3"
> [7] "group.Control:subject.4" "group.Treatment:subject.4"
> [9] "group.Control:subject.5" "group.Treatment:subject.5"
> [11] "group.Control:subject.6" "group.Treatment:subject.6"
> [13] "group.Control:subject.7" "group.Treatment:subject.7"
> [15] "group.Control:subject.8" "group.Treatment:subject.8"
> [17] "group.Control:subject.9" "group.Treatment:subject.9"
> [19] "group.Control:subject.10" "group.Treatment:subject.10"
> [21] "group.Control:subject.11" "group.Treatment:subject.11"
> [23] "group.Control:subject.12" "group.Treatment:subject.12"
> [25] "group.Control:subject.13" "group.Treatment:subject.13"
> [27] "group.Control:subject.14" "group.Treatment:subject.14"
> [29] "group.Control:subject.15" "group.Treatment:subject.15"
> [31] "group.Control:subject.16" "group.Treatment:subject.16"
> [33] "group.Control:subject.17" "group.Treatment:subject.17"
> [35] "group.Control:subject.18" "group.Treatment:subject.18"
> [37] "group.Control:times.time2" "group.Treatment:times.time2"
> [39] "group.Control:times.time3" "group.Treatment:times.time3"
>
>
>
>
>
> On Tue, Jul 2, 2013 at 10:50 AM, Charles Determan Jr <deter088 at umn.edu>wrote:
>
>> Thank you Gordon,
>> I apologize for not being more straightforward with a specific question.
>> I continue to be reminded by my ignorance in approaching statistical
>> problems but my education continues. I had previously been under the
>> impression that different experimental designs have specific methods that
>> are most appropriate. I previously read section 3.5 in the edgeR guide but
>> glossed over it because it didn't have time points explicitly included. I
>> feel a little silly that the idea of between and within subjects escaped me
>> but that should serve my purposes. If you will indulge one further
>> question concerning that very example.
>>
>> The design matrix I generate looks like this:
>>
>>> colnames(design) [1] "(Intercept)" "group.Treatment"
>> [3] "group.control:subject" "group.Treatment:subject"
>> [5] "group.control:times.time2" "group.Treatment:times.time2"
>> [7] "group.control:times.time3" "group.Treatment:times.time3"
>>
>>
>> You said "The analysis given in the edgeR user's guide allows you to find
>> genes that are different over time for (i) treated subjects and (ii)
>> control subjects, and it allows you to find genes that respond differently
>> to time in the treated vs control subject." If I am not mistaken,
>> coefficients 5-8 correspond to point (i) and (ii). However, I don't see
>> how I can determine which genes respond different to time in the treated
>> vs. control subject. I apologize if I seem obtuse but these interactions
>> have always been difficult for me to conceptualize. Any explanation or
>> direction so that I may understand these interactions related to your
>> points would be sincerely appreciated.
>>
>> Best regards,
>> Charles
>>
>>
>> On Tue, Jul 2, 2013 at 3:31 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>
>>> Hi Charles,
>>>
>>> Yes, you're on the right track now, but this is not a simple design and
>>> it requires care. As James says, it depends on what assumptions you want
>>> to make. I would add that it also depends on what questions you want to
>>> answer. In my previous two posts, I tried to prompt you to state what
>>> questions you want to answer, but you haven't taken the bait yet. A
>>> statistical analysis is always designed to test certain scientific
>>> questions -- there isn't a "correct" analysis for a given design
>>> independent of what your hypotheses are.
>>>
>>> Have you looked at Section 3.5 "Comparisons Both Between and Within
>>> Subjects" in the edgeR User's Guide? The design discussed in this section
>>> is the same as your experiment, except that you have 3 repeated measures
>>> per subject instead of 2.
>>>
>>> The analysis given in the edgeR user's guide allows you to find genes
>>> that are different over time for (i) treated subjects and (ii) control
>>> subjects, and it allows you to find genes that respond differently to time
>>> in the treated vs control subject.
>>>
>>> However it does not allow you to test for a baseline difference between
>>> treated and control subjects at time 0. If you need to do this, then a
>>> quite different analysis is needed (discussed in Section 9.7 "Multi-level
>>> Experiments" of the limma User's Guide).
>>>
>>> Best wishes
>>> Gordon
>>>
>>>
>>> On Mon, 1 Jul 2013, James W. MacDonald wrote:
>>>
>>> Hi Charles,
>>>>
>>>> On 7/1/2013 9:07 AM, Charles Determan Jr wrote:
>>>>
>>>
>>> I apologize for a second post but I want to bring this questing back up
>>>>> as I still cannot find a definitive answer on my own. In brief, I am
>>>>> wondering about the design matrix when testing for differential expression
>>>>> between two groups within which each sample has been measured at
>>>>> consecutive timepoints (repeated measures). Therefore, if my
>>>>> interpretations are correct, I need a two-way analysis that recognizes
>>>>> dependence between consecutive measurements. I am familiar with limma,
>>>>> edgeR and DESeq but am uncertain how to design an appropriate design matrix
>>>>> for these comparisons. The best I can guess is that I add a 'Subject'
>>>>> factor to the design matrix corresponding to each unique sample to correct
>>>>> for dependence, is this correct?
>>>>>
>>>>
>>>> It depends on how sophisticated you want to get, or alternatively what
>>>> assumptions you are willing to make.
>>>>
>>>> The simplest thing to do would be to block on subject (see the blocking
>>>> portion of the limma User's guide, starting on p. 42). This makes very
>>>> simple assumptions about the data, namely that the differences between
>>>> subjects can be accounted for by the mean of each subject.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>> My sincere regards,
>>>>> Charles
>>>>>
>>>>>
>>>>> On Wed, Jun 26, 2013 at 11:54 AM, Charles Determan Jr<deter088 at umn.edu
>>>>>> wrote:
>>>>>
>>>>> To help clarify further here is a dataframe of the design.
>>>>>>
>>>>>> subject group times
>>>>>> 1 1 Treated 0hr
>>>>>> 2 2 Treated 0hr
>>>>>> 3 3 Control 0hr
>>>>>> 4 4 Treated 0hr
>>>>>> 5 5 Control 0hr
>>>>>> 6 6 Control 0hr
>>>>>> 7 1 Treated 1hr
>>>>>> 8 2 Treated 1hr
>>>>>> 9 3 Control 1hr
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> 17 5 Control 2hr
>>>>>>
>>>>>> 18 6 Control 2hr
>>>>>>
>>>>>> My thought process has been as follows:
>>>>>>
>>>>>> In the edgeR userguide there is the treatment combination example
>>>>>>
>>>>>> targets
>>>>>>>
>>>>>> Sample Treat Time
>>>>>> 1 Sample1 Placebo 0h
>>>>>> 2 Sample2 Placebo 0h
>>>>>> 3 Sample3 Placebo 1h
>>>>>> 4 Sample4 Placebo 1h
>>>>>> 5 Sample5 Placebo 2h
>>>>>>
>>>>>> 6 Sample6 Placebo 2h
>>>>>> 7 Sample1 Drug 0h
>>>>>> 8 Sample2 Drug 0h
>>>>>> 9 Sample3 Drug 1h
>>>>>> 10 Sample4 Drug 1h
>>>>>> 11 Sample5 Drug 2h
>>>>>> 12 Sample6 Drug 2h
>>>>>>
>>>>>> which combines the groups to produce a single group (ex. Drug.1,
>>>>>> Placebo.1, Drug.2, etc)
>>>>>>
>>>>>> This seems potentially appropriate but this appears to assume
>>>>>> independence between samples whereas my data consists of what you could
>>>>>> call 'true repeated measures' on the same sample. This seems to draw on
>>>>>> the paired samples and blocked examples. These proceed by having the
>>>>>> 'subject' as a factor as well, for example:
>>>>>>
>>>>>> design<- model.matrix(~Subject+**Treatment)
>>>>>>
>>>>>> This leads me to guess that a combination of these techniques is
>>>>>> required. Perhaps merging the times and group factors in my dataset (see
>>>>>> above) as 'newgroup' (e.g. Control.0, Control.1, Treatment.0, etc). Then
>>>>>> create the model formula:
>>>>>>
>>>>>> design<- model.matrix(~Subject+**newgroup)
>>>>>>
>>>>>> Does this seem appropriate or am I way off base and over thinking
>>>>>> this? Thanks for any suggestions.
>>>>>>
>>>>>> Regards,
>>>>>> Charles
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 25, 2013 at 11:11 PM, Gordon K Smyth<smyth at wehi.edu.au
>>>>>>> wrote:
>>>>>>
>>>>>> Charles,
>>>>>>>
>>>>>>> Are there only 2 biological units in your experiment? (One for
>>>>>>> treatment
>>>>>>> and one for control?) Or do you have multiple biological units in
>>>>>>> each
>>>>>>> group? Surely it must be the latter but, if so, your model does not
>>>>>>> take
>>>>>>> this into account.
>>>>>>>
>>>>>>> What questions do you want to test?
>>>>>>>
>>>>>>> Best
>>>>>>> Gordon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 25 Jun 2013, Charles Determan Jr wrote:
>>>>>>>
>>>>>>> Gordon,
>>>>>>>
>>>>>>>> I apologize for not being more definitive with my description. Your
>>>>>>>> initial definition is my intention, consecutive measurements on the same
>>>>>>>> biological units. I will look over the comments in the link you provided.
>>>>>>>> Thank you for your insight, I appreciate any further thoughts you may have.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Charles
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 25, 2013 at 6:57 PM, Gordon K Smyth<smyth at wehi.edu.au>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Dear Charles,
>>>>>>>>
>>>>>>>>> The term "repeated measures" describes a situation in which repeated
>>>>>>>>> measurements are made on the same biological unit. Hence the
>>>>>>>>> repeated
>>>>>>>>> measurements are correlated. It is not clear from the brief
>>>>>>>>> information
>>>>>>>>> you give whether this is the case, or whether the different time
>>>>>>>>> points
>>>>>>>>> derive from independent biological samples.
>>>>>>>>>
>>>>>>>>> The model you give might or might not be correct, depending on the
>>>>>>>>> experimental units and the hypotheses that you plan to test. For
>>>>>>>>> most
>>>>>>>>> experiments it is not the right approach, for reasons that I have
>>>>>>>>> pointed
>>>>>>>>> out elsewhere:
>>>>>>>>>
>>>>>>>>> https://www.stat.math.ethz.ch/******pipermail/bioconductor/**
>>>>>>>>> 2013-****<https://www.stat.math.ethz.ch/****pipermail/bioconductor/2013-****>
>>>>>>>>> <https://www.stat.**math.ethz.ch/**pipermail/**bioconductor/2013-**<https://www.stat.math.ethz.ch/**pipermail/bioconductor/2013-**>
>>>>>>>>>>
>>>>>>>>> June/053297.html<https://www.****stat.math.ethz.ch/pipermail/****<http://stat.math.ethz.ch/pipermail/**>
>>>>>>>>> bioconductor/2013-June/053297.****html<https://www.stat.math.**
>>>>>>>>> ethz.ch/pipermail/**bioconductor/2013-June/053297.**html<https://www.stat.math.ethz.ch/pipermail/bioconductor/2013-June/053297.html>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>> Gordon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Date: Mon, 24 Jun 2013 15:08:48 -0500
>>>>>>>>>
>>>>>>>>> From: Charles Determan Jr<deter088 at umn.edu>
>>>>>>>>>> To: bioconductor at r-project.org
>>>>>>>>>> Subject: [BioC] Repeated Measures mRNA expression analysis
>>>>>>>>>>
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> I need to analyze data collected from an RNA-seq experiment. This
>>>>>>>>>> consists of comparing two groups (control vs. treatment) and repeated
>>>>>>>>>> sampling (1 hour, 2 hours, 3 hours). If this were a univariate problem I
>>>>>>>>>> know I would use a 2-way rmANOVA analysis but this is RNA-seq and I have
>>>>>>>>>> thousands of variables. I am very familiar with multiple packages for RNA
>>>>>>>>>> differential expression analysis (e.g. DESeq2, edgeR, limma, etc.) but I
>>>>>>>>>> have been unable to figure out what the most appropriate way to analyze
>>>>>>>>>> such data in this circumstance. The closest answer I can find within the
>>>>>>>>>> DESeq2 and edgeR manuals (limma is somewhat confusing to me) is to place to
>>>>>>>>>> main treatment of interest at the end of the design formula, for example:
>>>>>>>>>>
>>>>>>>>>> design(dds)<- formula(~ time + treatment)
>>>>>>>>>>
>>>>>>>>>> Is this what is considered the appropriate way to address repeated
>>>>>>>>>> measures in mRNA expression experiments? Any thoughts are appreciated.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Charles Determan
>>>>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>>>>> University of Minnesota
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>> Charles Determan
>>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>>> University of Minnesota
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list