[BioC] Repeated Measures mRNA expression analysis

Wed Jul 3 09:12:28 CEST 2013

On Tue, 2 Jul 2013, Charles Determan Jr wrote:

> I made an error in my last response, my subject was not set as a factor. 
> The design matrix looks like the matrix below.  Perhaps a shorter 
> question that will satisfy me is what I do with all the 'subject' 
> coefficients.  If I am only interested in the genes that respond 
> differently to time in treated vs. control subjects do I simply ignore 
> the subject coefficients?

Yes, you ignore them.  They take care of the subject baseline effects, but 
these are not of interest to you.

> Lastly, am I correct that to determine genes that respond differently to 
> time in treated vs. control subjects I simply conduct the contrasts 
> between the last coefficients (i.e. 40-39 and 38-37).

Yes.

> Apologies for turning this into such a long post, I hope it is helpful 
> for others as well.

To find genes DE at time2 in the controls:
   coef = "group.Control:times.time2"

To find genes DE at time2 in the treated subjects:
   coef = "group.Treatment:times.time2"

To find genes DE at time3 in the controls:
   coef = "group.Control:times.time3"

To find genes DE at time3 in the treated subjects:
   coef = "group.reatment:times.time3"

Best wishes
Gordon

> Thanks, Charles
>
>> colnames(design)
> [1] "(Intercept)" "group.Treatment"
> [3] "group.Control:subject.2" "group.Treatment:subject.2"
> [5] "group.Control:subject.3" "group.Treatment:subject.3"
> [7] "group.Control:subject.4" "group.Treatment:subject.4"
> [9] "group.Control:subject.5" "group.Treatment:subject.5"
> [11] "group.Control:subject.6" "group.Treatment:subject.6"
> [13] "group.Control:subject.7" "group.Treatment:subject.7"
> [15] "group.Control:subject.8" "group.Treatment:subject.8"
> [17] "group.Control:subject.9" "group.Treatment:subject.9"
> [19] "group.Control:subject.10" "group.Treatment:subject.10"
> [21] "group.Control:subject.11" "group.Treatment:subject.11"
> [23] "group.Control:subject.12" "group.Treatment:subject.12"
> [25] "group.Control:subject.13" "group.Treatment:subject.13"
> [27] "group.Control:subject.14" "group.Treatment:subject.14"
> [29] "group.Control:subject.15" "group.Treatment:subject.15"
> [31] "group.Control:subject.16" "group.Treatment:subject.16"
> [33] "group.Control:subject.17" "group.Treatment:subject.17"
> [35] "group.Control:subject.18" "group.Treatment:subject.18"
> [37] "group.Control:times.time2" "group.Treatment:times.time2"
> [39] "group.Control:times.time3" "group.Treatment:times.time3"
>
>
>
>
>
> On Tue, Jul 2, 2013 at 10:50 AM, Charles Determan Jr <deter088 at umn.edu>wrote:
>
>> Thank you Gordon,
>> I apologize for not being more straightforward with a specific question.
>> I continue to be reminded by my ignorance in approaching statistical
>> problems but my education continues.  I had previously been under the
>> impression that different experimental designs have specific methods that
>> are most appropriate.  I previously read section 3.5 in the edgeR guide but
>> glossed over it because it didn't have time points explicitly included.  I
>> feel a little silly that the idea of between and within subjects escaped me
>> but that should serve my purposes.  If you will indulge one further
>> question concerning that very example.
>>
>> The design matrix I generate looks like this:
>>
>>> colnames(design) [1] "(Intercept)"                "group.Treatment"
>>  [3] "group.control:subject"      "group.Treatment:subject"
>>  [5] "group.control:times.time2"  "group.Treatment:times.time2"
>>  [7] "group.control:times.time3"  "group.Treatment:times.time3"
>>
>>
>> You said "The analysis given in the edgeR user's guide allows you to find
>> genes that are different over time for (i) treated subjects and (ii)
>> control subjects, and it allows you to find genes that respond differently
>> to time in the treated vs control subject."  If I am not mistaken,
>> coefficients 5-8 correspond to point (i) and (ii).  However, I don't see
>> how I can determine which genes respond different to time in the treated
>> vs. control subject.  I apologize if I seem obtuse but these interactions
>> have always been difficult for me to conceptualize.  Any explanation or
>> direction so that I may understand these interactions related to your
>> points would be sincerely appreciated.
>>
>> Best regards,
>> Charles
>>
>>
>> On Tue, Jul 2, 2013 at 3:31 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>
>>> Hi Charles,
>>>
>>> Yes, you're on the right track now, but this is not a simple design and
>>> it requires care.  As James says, it depends on what assumptions you want
>>> to make.  I would add that it also depends on what questions you want to
>>> answer.  In my previous two posts, I tried to prompt you to state what
>>> questions you want to answer, but you haven't taken the bait yet.  A
>>> statistical analysis is always designed to test certain scientific
>>> questions -- there isn't a "correct" analysis for a given design
>>> independent of what your hypotheses are.
>>>
>>> Have you looked at Section 3.5 "Comparisons Both Between and Within
>>> Subjects" in the edgeR User's Guide?  The design discussed in this section
>>> is the same as your experiment, except that you have 3 repeated measures
>>> per subject instead of 2.
>>>
>>> The analysis given in the edgeR user's guide allows you to find genes
>>> that are different over time for (i) treated subjects and (ii) control
>>> subjects, and it allows you to find genes that respond differently to time
>>> in the treated vs control subject.
>>>
>>> However it does not allow you to test for a baseline difference between
>>> treated and control subjects at time 0.  If you need to do this, then a
>>> quite different analysis is needed (discussed in Section 9.7 "Multi-level
>>> Experiments" of the limma User's Guide).
>>>
>>> Best wishes
>>> Gordon
>>>
>>>
>>> On Mon, 1 Jul 2013, James W. MacDonald wrote:
>>>
>>>  Hi Charles,
>>>>
>>>> On 7/1/2013 9:07 AM, Charles Determan Jr wrote:
>>>>
>>>
>>>  I apologize for a second post but I want to bring this questing back up
>>>>> as I still cannot find a definitive answer on my own.  In brief, I am
>>>>> wondering about the design matrix when testing for differential expression
>>>>> between two groups within which each sample has been measured at
>>>>> consecutive timepoints (repeated measures).  Therefore, if my
>>>>> interpretations are correct, I need a two-way analysis that recognizes
>>>>> dependence between consecutive measurements.  I am familiar with limma,
>>>>> edgeR and DESeq but am uncertain how to design an appropriate design matrix
>>>>> for these comparisons.  The best I can guess is that I add a 'Subject'
>>>>> factor to the design matrix corresponding to each unique sample to correct
>>>>> for dependence, is this correct?
>>>>>
>>>>
>>>> It depends on how sophisticated you want to get, or alternatively what
>>>> assumptions you are willing to make.
>>>>
>>>> The simplest thing to do would be to block on subject (see the blocking
>>>> portion of the limma User's guide, starting on p. 42). This makes very
>>>> simple assumptions about the data, namely that the differences between
>>>> subjects can be accounted for by the mean of each subject.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>> My sincere regards,
>>>>> Charles
>>>>>
>>>>>
>>>>> On Wed, Jun 26, 2013 at 11:54 AM, Charles Determan Jr<deter088 at umn.edu
>>>>>> wrote:
>>>>>
>>>>>  To help clarify further here is a dataframe of the design.
>>>>>>
>>>>>>     subject  group times
>>>>>> 1        1 Treated    0hr
>>>>>> 2        2 Treated    0hr
>>>>>> 3        3 Control    0hr
>>>>>> 4        4 Treated    0hr
>>>>>> 5        5 Control    0hr
>>>>>> 6        6 Control    0hr
>>>>>> 7        1 Treated    1hr
>>>>>> 8        2 Treated    1hr
>>>>>> 9        3 Control    1hr
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> 17       5 Control    2hr
>>>>>>
>>>>>> 18 6 Control 2hr
>>>>>>
>>>>>> My thought process has been as follows:
>>>>>>
>>>>>> In the edgeR userguide there is the treatment combination example
>>>>>>
>>>>>>  targets
>>>>>>>
>>>>>> Sample Treat Time
>>>>>> 1 Sample1 Placebo 0h
>>>>>> 2 Sample2 Placebo 0h
>>>>>> 3 Sample3 Placebo 1h
>>>>>> 4 Sample4 Placebo 1h
>>>>>> 5 Sample5 Placebo 2h
>>>>>>
>>>>>> 6 Sample6 Placebo 2h
>>>>>> 7 Sample1 Drug 0h
>>>>>> 8 Sample2 Drug 0h
>>>>>> 9 Sample3 Drug 1h
>>>>>> 10 Sample4 Drug 1h
>>>>>> 11 Sample5 Drug 2h
>>>>>> 12 Sample6 Drug 2h
>>>>>>
>>>>>> which combines the groups to produce a single group (ex. Drug.1,
>>>>>> Placebo.1, Drug.2, etc)
>>>>>>
>>>>>> This seems potentially appropriate but this appears to assume
>>>>>> independence between samples whereas my data consists of what you could
>>>>>> call 'true repeated measures' on the same sample.  This seems to draw on
>>>>>> the paired samples and blocked examples.  These proceed by having the
>>>>>> 'subject' as a factor as well, for example:
>>>>>>
>>>>>> design<- model.matrix(~Subject+**Treatment)
>>>>>>
>>>>>> This leads me to guess that a combination of these techniques is
>>>>>> required. Perhaps merging the times and group factors in my dataset (see
>>>>>> above) as 'newgroup' (e.g. Control.0, Control.1, Treatment.0, etc).  Then
>>>>>> create the model formula:
>>>>>>
>>>>>> design<- model.matrix(~Subject+**newgroup)
>>>>>>
>>>>>> Does this seem appropriate or am I way off base and over thinking
>>>>>> this? Thanks for any suggestions.
>>>>>>
>>>>>> Regards,
>>>>>> Charles
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 25, 2013 at 11:11 PM, Gordon K Smyth<smyth at wehi.edu.au
>>>>>>> wrote:
>>>>>>
>>>>>>  Charles,
>>>>>>>
>>>>>>> Are there only 2 biological units in your experiment?  (One for
>>>>>>> treatment
>>>>>>> and one for control?)  Or do you have multiple biological units in
>>>>>>> each
>>>>>>> group?  Surely it must be the latter but, if so, your model does not
>>>>>>> take
>>>>>>> this into account.
>>>>>>>
>>>>>>> What questions do you want to test?
>>>>>>>
>>>>>>> Best
>>>>>>> Gordon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 25 Jun 2013, Charles Determan Jr wrote:
>>>>>>>
>>>>>>>   Gordon,
>>>>>>>
>>>>>>>> I apologize for not being more definitive with my description. Your
>>>>>>>> initial definition is my intention, consecutive measurements on the same
>>>>>>>> biological units.  I will look over the comments in the link you provided.
>>>>>>>> Thank you for your insight, I appreciate any further thoughts you may have.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Charles
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 25, 2013 at 6:57 PM, Gordon K Smyth<smyth at wehi.edu.au>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   Dear Charles,
>>>>>>>>
>>>>>>>>> The term "repeated measures" describes a situation in which repeated
>>>>>>>>> measurements are made on the same biological unit.  Hence the
>>>>>>>>> repeated
>>>>>>>>> measurements are correlated.  It is not clear from the brief
>>>>>>>>> information
>>>>>>>>> you give whether this is the case, or whether the different time
>>>>>>>>> points
>>>>>>>>> derive from independent biological samples.
>>>>>>>>>
>>>>>>>>> The model you give might or might not be correct, depending on the
>>>>>>>>> experimental units and the hypotheses that you plan to test.  For
>>>>>>>>> most
>>>>>>>>> experiments it is not the right approach, for reasons that I have
>>>>>>>>> pointed
>>>>>>>>> out elsewhere:
>>>>>>>>>
>>>>>>>>> https://www.stat.math.ethz.ch/******pipermail/bioconductor/**
>>>>>>>>> 2013-****<https://www.stat.math.ethz.ch/****pipermail/bioconductor/2013-****>
>>>>>>>>> <https://www.stat.**math.ethz.ch/**pipermail/**bioconductor/2013-**<https://www.stat.math.ethz.ch/**pipermail/bioconductor/2013-**>
>>>>>>>>>>
>>>>>>>>> June/053297.html<https://www.****stat.math.ethz.ch/pipermail/****<http://stat.math.ethz.ch/pipermail/**>
>>>>>>>>> bioconductor/2013-June/053297.****html<https://www.stat.math.**
>>>>>>>>> ethz.ch/pipermail/**bioconductor/2013-June/053297.**html<https://www.stat.math.ethz.ch/pipermail/bioconductor/2013-June/053297.html>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best wishes
>>>>>>>>> Gordon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   Date: Mon, 24 Jun 2013 15:08:48 -0500
>>>>>>>>>
>>>>>>>>>  From: Charles Determan Jr<deter088 at umn.edu>
>>>>>>>>>> To: bioconductor at r-project.org
>>>>>>>>>> Subject: [BioC] Repeated Measures mRNA expression analysis
>>>>>>>>>>
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> I need to analyze data collected from an RNA-seq experiment. This
>>>>>>>>>> consists of comparing two groups (control vs. treatment) and repeated
>>>>>>>>>> sampling (1 hour, 2 hours, 3 hours).  If this were a univariate problem I
>>>>>>>>>> know I would use a 2-way rmANOVA analysis but this is RNA-seq and I have
>>>>>>>>>> thousands of variables.  I am very familiar with multiple packages for RNA
>>>>>>>>>> differential expression analysis (e.g. DESeq2, edgeR, limma, etc.) but I
>>>>>>>>>> have been unable to figure out what the most appropriate way to analyze
>>>>>>>>>> such data in this circumstance. The closest answer I can find within the
>>>>>>>>>> DESeq2 and edgeR manuals (limma is somewhat confusing to me) is to place to
>>>>>>>>>> main treatment of interest at the end of the design formula, for example:
>>>>>>>>>>
>>>>>>>>>> design(dds)<- formula(~ time + treatment)
>>>>>>>>>>
>>>>>>>>>> Is this what is considered the appropriate way to address repeated
>>>>>>>>>> measures in mRNA expression experiments?  Any thoughts are appreciated.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Charles Determan
>>>>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>>>>> University of Minnesota
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>> Charles Determan
>>>>>>>> Integrated Biosciences PhD Candidate
>>>>>>>> University of Minnesota
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}