[BioC] Repeated Measures mRNA expression analysis
Gordon K Smyth
smyth at wehi.EDU.AU
Wed Jul 3 09:17:13 CEST 2013
On Tue, 2 Jul 2013, Charles Determan Jr wrote:
> Thank you Gordon,
> I apologize for not being more straightforward with a specific question.
> I continue to be reminded by my ignorance in approaching statistical
> problems but my education continues. I had previously been under the
> impression that different experimental designs have specific methods
> that are most appropriate.
This is a common misconception. The appropriate analysis is not
determined solely by the experimental layout.
Gordon
> I previously read section 3.5 in the edgeR guide but glossed over it
> because it didn't have time points explicitly included. I feel a little
> silly that the idea of between and within subjects escaped me but that
> should serve my purposes. If you will indulge one further question
> concerning that very example.
>
> The design matrix I generate looks like this:
>
>> colnames(design) [1] "(Intercept)" "group.Treatment"
> [3] "group.control:subject" "group.Treatment:subject"
> [5] "group.control:times.time2" "group.Treatment:times.time2"
> [7] "group.control:times.time3" "group.Treatment:times.time3"
>
> You said "The analysis given in the edgeR user's guide allows you to
> find genes that are different over time for (i) treated subjects and
> (ii) control subjects, and it allows you to find genes that respond
> differently to time in the treated vs control subject." If I am not
> mistaken, coefficients 5-8 correspond to point (i) and (ii). However, I
> don't see how I can determine which genes respond different to time in
> the treated vs. control subject. I apologize if I seem obtuse but these
> interactions have always been difficult for me to conceptualize. Any
> explanation or direction so that I may understand these interactions
> related to your points would be sincerely appreciated.
>
> Best regards,
> Charles
>
>
> On Tue, Jul 2, 2013 at 3:31 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Hi Charles,
>>
>> Yes, you're on the right track now, but this is not a simple design and it
>> requires care. As James says, it depends on what assumptions you want to
>> make. I would add that it also depends on what questions you want to
>> answer. In my previous two posts, I tried to prompt you to state what
>> questions you want to answer, but you haven't taken the bait yet. A
>> statistical analysis is always designed to test certain scientific
>> questions -- there isn't a "correct" analysis for a given design
>> independent of what your hypotheses are.
>>
>> Have you looked at Section 3.5 "Comparisons Both Between and Within
>> Subjects" in the edgeR User's Guide? The design discussed in this section
>> is the same as your experiment, except that you have 3 repeated measures
>> per subject instead of 2.
>>
>> The analysis given in the edgeR user's guide allows you to find genes that
>> are different over time for (i) treated subjects and (ii) control subjects,
>> and it allows you to find genes that respond differently to time in the
>> treated vs control subject.
>>
>> However it does not allow you to test for a baseline difference between
>> treated and control subjects at time 0. If you need to do this, then a
>> quite different analysis is needed (discussed in Section 9.7 "Multi-level
>> Experiments" of the limma User's Guide).
>>
>> Best wishes
>> Gordon
>>
>>
>> On Mon, 1 Jul 2013, James W. MacDonald wrote:
>>
>> Hi Charles,
>>>
>>> On 7/1/2013 9:07 AM, Charles Determan Jr wrote:
>>>
>>
>> I apologize for a second post but I want to bring this questing back up
>>>> as I still cannot find a definitive answer on my own. In brief, I am
>>>> wondering about the design matrix when testing for differential expression
>>>> between two groups within which each sample has been measured at
>>>> consecutive timepoints (repeated measures). Therefore, if my
>>>> interpretations are correct, I need a two-way analysis that recognizes
>>>> dependence between consecutive measurements. I am familiar with limma,
>>>> edgeR and DESeq but am uncertain how to design an appropriate design matrix
>>>> for these comparisons. The best I can guess is that I add a 'Subject'
>>>> factor to the design matrix corresponding to each unique sample to correct
>>>> for dependence, is this correct?
>>>>
>>>
>>> It depends on how sophisticated you want to get, or alternatively what
>>> assumptions you are willing to make.
>>>
>>> The simplest thing to do would be to block on subject (see the blocking
>>> portion of the limma User's guide, starting on p. 42). This makes very
>>> simple assumptions about the data, namely that the differences between
>>> subjects can be accounted for by the mean of each subject.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>> My sincere regards,
>>>> Charles
>>>>
>>>>
>>>> On Wed, Jun 26, 2013 at 11:54 AM, Charles Determan Jr<deter088 at umn.edu
>>>>> wrote:
>>>>
>>>> To help clarify further here is a dataframe of the design.
>>>>>
>>>>> subject group times
>>>>> 1 1 Treated 0hr
>>>>> 2 2 Treated 0hr
>>>>> 3 3 Control 0hr
>>>>> 4 4 Treated 0hr
>>>>> 5 5 Control 0hr
>>>>> 6 6 Control 0hr
>>>>> 7 1 Treated 1hr
>>>>> 8 2 Treated 1hr
>>>>> 9 3 Control 1hr
>>>>>
>>>>> ...
>>>>>
>>>>> 17 5 Control 2hr
>>>>>
>>>>> 18 6 Control 2hr
>>>>>
>>>>> My thought process has been as follows:
>>>>>
>>>>> In the edgeR userguide there is the treatment combination example
>>>>>
>>>>> targets
>>>>>>
>>>>> Sample Treat Time
>>>>> 1 Sample1 Placebo 0h
>>>>> 2 Sample2 Placebo 0h
>>>>> 3 Sample3 Placebo 1h
>>>>> 4 Sample4 Placebo 1h
>>>>> 5 Sample5 Placebo 2h
>>>>>
>>>>> 6 Sample6 Placebo 2h
>>>>> 7 Sample1 Drug 0h
>>>>> 8 Sample2 Drug 0h
>>>>> 9 Sample3 Drug 1h
>>>>> 10 Sample4 Drug 1h
>>>>> 11 Sample5 Drug 2h
>>>>> 12 Sample6 Drug 2h
>>>>>
>>>>> which combines the groups to produce a single group (ex. Drug.1,
>>>>> Placebo.1, Drug.2, etc)
>>>>>
>>>>> This seems potentially appropriate but this appears to assume
>>>>> independence between samples whereas my data consists of what you could
>>>>> call 'true repeated measures' on the same sample. This seems to draw on
>>>>> the paired samples and blocked examples. These proceed by having the
>>>>> 'subject' as a factor as well, for example:
>>>>>
>>>>> design<- model.matrix(~Subject+**Treatment)
>>>>>
>>>>> This leads me to guess that a combination of these techniques is
>>>>> required. Perhaps merging the times and group factors in my dataset (see
>>>>> above) as 'newgroup' (e.g. Control.0, Control.1, Treatment.0, etc). Then
>>>>> create the model formula:
>>>>>
>>>>> design<- model.matrix(~Subject+**newgroup)
>>>>>
>>>>> Does this seem appropriate or am I way off base and over thinking this?
>>>>> Thanks for any suggestions.
>>>>>
>>>>> Regards,
>>>>> Charles
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 25, 2013 at 11:11 PM, Gordon K Smyth<smyth at wehi.edu.au
>>>>>> wrote:
>>>>>
>>>>> Charles,
>>>>>>
>>>>>> Are there only 2 biological units in your experiment? (One for
>>>>>> treatment
>>>>>> and one for control?) Or do you have multiple biological units in each
>>>>>> group? Surely it must be the latter but, if so, your model does not
>>>>>> take
>>>>>> this into account.
>>>>>>
>>>>>> What questions do you want to test?
>>>>>>
>>>>>> Best
>>>>>> Gordon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 25 Jun 2013, Charles Determan Jr wrote:
>>>>>>
>>>>>> Gordon,
>>>>>>
>>>>>>> I apologize for not being more definitive with my description. Your
>>>>>>> initial definition is my intention, consecutive measurements on the same
>>>>>>> biological units. I will look over the comments in the link you provided.
>>>>>>> Thank you for your insight, I appreciate any further thoughts you may have.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Charles
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 25, 2013 at 6:57 PM, Gordon K Smyth<smyth at wehi.edu.au>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Dear Charles,
>>>>>>>
>>>>>>>> The term "repeated measures" describes a situation in which repeated
>>>>>>>> measurements are made on the same biological unit. Hence the
>>>>>>>> repeated
>>>>>>>> measurements are correlated. It is not clear from the brief
>>>>>>>> information
>>>>>>>> you give whether this is the case, or whether the different time
>>>>>>>> points
>>>>>>>> derive from independent biological samples.
>>>>>>>>
>>>>>>>> The model you give might or might not be correct, depending on the
>>>>>>>> experimental units and the hypotheses that you plan to test. For
>>>>>>>> most
>>>>>>>> experiments it is not the right approach, for reasons that I have
>>>>>>>> pointed
>>>>>>>> out elsewhere:
>>>>>>>>
>>>>>>>> https://www.stat.math.ethz.ch/******pipermail/bioconductor/**
>>>>>>>> 2013-****<https://www.stat.math.ethz.ch/****pipermail/bioconductor/2013-****>
>>>>>>>> <https://www.stat.**math.ethz.ch/**pipermail/**bioconductor/2013-**<https://www.stat.math.ethz.ch/**pipermail/bioconductor/2013-**>
>>>>>>>>>
>>>>>>>> June/053297.html<https://www.****stat.math.ethz.ch/pipermail/****<http://stat.math.ethz.ch/pipermail/**>
>>>>>>>> bioconductor/2013-June/053297.****html<https://www.stat.math.**
>>>>>>>> ethz.ch/pipermail/**bioconductor/2013-June/053297.**html<https://www.stat.math.ethz.ch/pipermail/bioconductor/2013-June/053297.html>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Gordon
>>>>>>>>
>>>>>>>>
>>>>>>>> Date: Mon, 24 Jun 2013 15:08:48 -0500
>>>>>>>>
>>>>>>>> From: Charles Determan Jr<deter088 at umn.edu>
>>>>>>>>> To: bioconductor at r-project.org
>>>>>>>>> Subject: [BioC] Repeated Measures mRNA expression analysis
>>>>>>>>>
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I need to analyze data collected from an RNA-seq experiment. This
>>>>>>>>> consists of comparing two groups (control vs. treatment) and repeated
>>>>>>>>> sampling (1 hour, 2 hours, 3 hours). If this were a univariate problem I
>>>>>>>>> know I would use a 2-way rmANOVA analysis but this is RNA-seq and I have
>>>>>>>>> thousands of variables. I am very familiar with multiple packages for RNA
>>>>>>>>> differential expression analysis (e.g. DESeq2, edgeR, limma, etc.) but I
>>>>>>>>> have been unable to figure out what the most appropriate way to analyze
>>>>>>>>> such data in this circumstance. The closest answer I can find within the
>>>>>>>>> DESeq2 and edgeR manuals (limma is somewhat confusing to me) is to place to
>>>>>>>>> main treatment of interest at the end of the design formula, for example:
>>>>>>>>>
>>>>>>>>> design(dds)<- formula(~ time + treatment)
>>>>>>>>>
>>>>>>>>> Is this what is considered the appropriate way to address repeated
>>>>>>>>> measures in mRNA expression experiments? Any thoughts are appreciated.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Charles Determan
>>>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>>>> University of Minnesota
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>> Charles Determan
>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>> University of Minnesota
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>>
>>>
>> ______________________________**______________________________**__________
>>
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________**______________________________**__________
>>
>
>
>
> --
> Charles Determan
> Integrated Biosciences PhD Candidate
> University of Minnesota
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list