[BioC] Repeated Measures mRNA expression analysis

Wed Jul 3 09:17:13 CEST 2013

On Tue, 2 Jul 2013, Charles Determan Jr wrote:

> Thank you Gordon,
> I apologize for not being more straightforward with a specific question. 
> I continue to be reminded by my ignorance in approaching statistical 
> problems but my education continues.  I had previously been under the 
> impression that different experimental designs have specific methods 
> that are most appropriate.

This is a common misconception.  The appropriate analysis is not 
determined solely by the experimental layout.

Gordon

> I previously read section 3.5 in the edgeR guide but glossed over it 
> because it didn't have time points explicitly included.  I feel a little 
> silly that the idea of between and within subjects escaped me but that 
> should serve my purposes.  If you will indulge one further question 
> concerning that very example.
>
> The design matrix I generate looks like this:
>
>> colnames(design) [1] "(Intercept)"                "group.Treatment"
> [3] "group.control:subject"      "group.Treatment:subject"
> [5] "group.control:times.time2"  "group.Treatment:times.time2"
> [7] "group.control:times.time3"  "group.Treatment:times.time3"
>
> You said "The analysis given in the edgeR user's guide allows you to 
> find genes that are different over time for (i) treated subjects and 
> (ii) control subjects, and it allows you to find genes that respond 
> differently to time in the treated vs control subject."  If I am not 
> mistaken, coefficients 5-8 correspond to point (i) and (ii).  However, I 
> don't see how I can determine which genes respond different to time in 
> the treated vs. control subject.  I apologize if I seem obtuse but these 
> interactions have always been difficult for me to conceptualize.  Any 
> explanation or direction so that I may understand these interactions 
> related to your points would be sincerely appreciated.
>
> Best regards,
> Charles
>
>
> On Tue, Jul 2, 2013 at 3:31 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Hi Charles,
>>
>> Yes, you're on the right track now, but this is not a simple design and it
>> requires care.  As James says, it depends on what assumptions you want to
>> make.  I would add that it also depends on what questions you want to
>> answer.  In my previous two posts, I tried to prompt you to state what
>> questions you want to answer, but you haven't taken the bait yet.  A
>> statistical analysis is always designed to test certain scientific
>> questions -- there isn't a "correct" analysis for a given design
>> independent of what your hypotheses are.
>>
>> Have you looked at Section 3.5 "Comparisons Both Between and Within
>> Subjects" in the edgeR User's Guide?  The design discussed in this section
>> is the same as your experiment, except that you have 3 repeated measures
>> per subject instead of 2.
>>
>> The analysis given in the edgeR user's guide allows you to find genes that
>> are different over time for (i) treated subjects and (ii) control subjects,
>> and it allows you to find genes that respond differently to time in the
>> treated vs control subject.
>>
>> However it does not allow you to test for a baseline difference between
>> treated and control subjects at time 0.  If you need to do this, then a
>> quite different analysis is needed (discussed in Section 9.7 "Multi-level
>> Experiments" of the limma User's Guide).
>>
>> Best wishes
>> Gordon
>>
>>
>> On Mon, 1 Jul 2013, James W. MacDonald wrote:
>>
>>  Hi Charles,
>>>
>>> On 7/1/2013 9:07 AM, Charles Determan Jr wrote:
>>>
>>
>>  I apologize for a second post but I want to bring this questing back up
>>>> as I still cannot find a definitive answer on my own.  In brief, I am
>>>> wondering about the design matrix when testing for differential expression
>>>> between two groups within which each sample has been measured at
>>>> consecutive timepoints (repeated measures).  Therefore, if my
>>>> interpretations are correct, I need a two-way analysis that recognizes
>>>> dependence between consecutive measurements.  I am familiar with limma,
>>>> edgeR and DESeq but am uncertain how to design an appropriate design matrix
>>>> for these comparisons.  The best I can guess is that I add a 'Subject'
>>>> factor to the design matrix corresponding to each unique sample to correct
>>>> for dependence, is this correct?
>>>>
>>>
>>> It depends on how sophisticated you want to get, or alternatively what
>>> assumptions you are willing to make.
>>>
>>> The simplest thing to do would be to block on subject (see the blocking
>>> portion of the limma User's guide, starting on p. 42). This makes very
>>> simple assumptions about the data, namely that the differences between
>>> subjects can be accounted for by the mean of each subject.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>> My sincere regards,
>>>> Charles
>>>>
>>>>
>>>> On Wed, Jun 26, 2013 at 11:54 AM, Charles Determan Jr<deter088 at umn.edu
>>>>> wrote:
>>>>
>>>>  To help clarify further here is a dataframe of the design.
>>>>>
>>>>>     subject  group times
>>>>> 1        1 Treated    0hr
>>>>> 2        2 Treated    0hr
>>>>> 3        3 Control    0hr
>>>>> 4        4 Treated    0hr
>>>>> 5        5 Control    0hr
>>>>> 6        6 Control    0hr
>>>>> 7        1 Treated    1hr
>>>>> 8        2 Treated    1hr
>>>>> 9        3 Control    1hr
>>>>>
>>>>> ...
>>>>>
>>>>> 17       5 Control    2hr
>>>>>
>>>>> 18 6 Control 2hr
>>>>>
>>>>> My thought process has been as follows:
>>>>>
>>>>> In the edgeR userguide there is the treatment combination example
>>>>>
>>>>>  targets
>>>>>>
>>>>> Sample Treat Time
>>>>> 1 Sample1 Placebo 0h
>>>>> 2 Sample2 Placebo 0h
>>>>> 3 Sample3 Placebo 1h
>>>>> 4 Sample4 Placebo 1h
>>>>> 5 Sample5 Placebo 2h
>>>>>
>>>>> 6 Sample6 Placebo 2h
>>>>> 7 Sample1 Drug 0h
>>>>> 8 Sample2 Drug 0h
>>>>> 9 Sample3 Drug 1h
>>>>> 10 Sample4 Drug 1h
>>>>> 11 Sample5 Drug 2h
>>>>> 12 Sample6 Drug 2h
>>>>>
>>>>> which combines the groups to produce a single group (ex. Drug.1,
>>>>> Placebo.1, Drug.2, etc)
>>>>>
>>>>> This seems potentially appropriate but this appears to assume
>>>>> independence between samples whereas my data consists of what you could
>>>>> call 'true repeated measures' on the same sample.  This seems to draw on
>>>>> the paired samples and blocked examples.  These proceed by having the
>>>>> 'subject' as a factor as well, for example:
>>>>>
>>>>> design<- model.matrix(~Subject+**Treatment)
>>>>>
>>>>> This leads me to guess that a combination of these techniques is
>>>>> required. Perhaps merging the times and group factors in my dataset (see
>>>>> above) as 'newgroup' (e.g. Control.0, Control.1, Treatment.0, etc).  Then
>>>>> create the model formula:
>>>>>
>>>>> design<- model.matrix(~Subject+**newgroup)
>>>>>
>>>>> Does this seem appropriate or am I way off base and over thinking this?
>>>>> Thanks for any suggestions.
>>>>>
>>>>> Regards,
>>>>> Charles
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 25, 2013 at 11:11 PM, Gordon K Smyth<smyth at wehi.edu.au
>>>>>> wrote:
>>>>>
>>>>>  Charles,
>>>>>>
>>>>>> Are there only 2 biological units in your experiment?  (One for
>>>>>> treatment
>>>>>> and one for control?)  Or do you have multiple biological units in each
>>>>>> group?  Surely it must be the latter but, if so, your model does not
>>>>>> take
>>>>>> this into account.
>>>>>>
>>>>>> What questions do you want to test?
>>>>>>
>>>>>> Best
>>>>>> Gordon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 25 Jun 2013, Charles Determan Jr wrote:
>>>>>>
>>>>>>   Gordon,
>>>>>>
>>>>>>> I apologize for not being more definitive with my description. Your
>>>>>>> initial definition is my intention, consecutive measurements on the same
>>>>>>> biological units.  I will look over the comments in the link you provided.
>>>>>>> Thank you for your insight, I appreciate any further thoughts you may have.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Charles
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 25, 2013 at 6:57 PM, Gordon K Smyth<smyth at wehi.edu.au>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Dear Charles,
>>>>>>>
>>>>>>>> The term "repeated measures" describes a situation in which repeated
>>>>>>>> measurements are made on the same biological unit.  Hence the
>>>>>>>> repeated
>>>>>>>> measurements are correlated.  It is not clear from the brief
>>>>>>>> information
>>>>>>>> you give whether this is the case, or whether the different time
>>>>>>>> points
>>>>>>>> derive from independent biological samples.
>>>>>>>>
>>>>>>>> The model you give might or might not be correct, depending on the
>>>>>>>> experimental units and the hypotheses that you plan to test.  For
>>>>>>>> most
>>>>>>>> experiments it is not the right approach, for reasons that I have
>>>>>>>> pointed
>>>>>>>> out elsewhere:
>>>>>>>>
>>>>>>>> https://www.stat.math.ethz.ch/******pipermail/bioconductor/**
>>>>>>>> 2013-****<https://www.stat.math.ethz.ch/****pipermail/bioconductor/2013-****>
>>>>>>>> <https://www.stat.**math.ethz.ch/**pipermail/**bioconductor/2013-**<https://www.stat.math.ethz.ch/**pipermail/bioconductor/2013-**>
>>>>>>>>>
>>>>>>>> June/053297.html<https://www.****stat.math.ethz.ch/pipermail/****<http://stat.math.ethz.ch/pipermail/**>
>>>>>>>> bioconductor/2013-June/053297.****html<https://www.stat.math.**
>>>>>>>> ethz.ch/pipermail/**bioconductor/2013-June/053297.**html<https://www.stat.math.ethz.ch/pipermail/bioconductor/2013-June/053297.html>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>> Gordon
>>>>>>>>
>>>>>>>>
>>>>>>>>   Date: Mon, 24 Jun 2013 15:08:48 -0500
>>>>>>>>
>>>>>>>>  From: Charles Determan Jr<deter088 at umn.edu>
>>>>>>>>> To: bioconductor at r-project.org
>>>>>>>>> Subject: [BioC] Repeated Measures mRNA expression analysis
>>>>>>>>>
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I need to analyze data collected from an RNA-seq experiment. This
>>>>>>>>> consists of comparing two groups (control vs. treatment) and repeated
>>>>>>>>> sampling (1 hour, 2 hours, 3 hours).  If this were a univariate problem I
>>>>>>>>> know I would use a 2-way rmANOVA analysis but this is RNA-seq and I have
>>>>>>>>> thousands of variables.  I am very familiar with multiple packages for RNA
>>>>>>>>> differential expression analysis (e.g. DESeq2, edgeR, limma, etc.) but I
>>>>>>>>> have been unable to figure out what the most appropriate way to analyze
>>>>>>>>> such data in this circumstance. The closest answer I can find within the
>>>>>>>>> DESeq2 and edgeR manuals (limma is somewhat confusing to me) is to place to
>>>>>>>>> main treatment of interest at the end of the design formula, for example:
>>>>>>>>>
>>>>>>>>> design(dds)<- formula(~ time + treatment)
>>>>>>>>>
>>>>>>>>> Is this what is considered the appropriate way to address repeated
>>>>>>>>> measures in mRNA expression experiments?  Any thoughts are appreciated.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Charles Determan
>>>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>>>> University of Minnesota
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>> Charles Determan
>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>> University of Minnesota
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>>
>>>
>> ______________________________**______________________________**__________
>>
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________**______________________________**__________
>>
>
>
>
> -- 
> Charles Determan
> Integrated Biosciences PhD Candidate
> University of Minnesota
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}