[BioC] Multifactorial edgeR GLM design question (contrast that I should make)

Wed Jan 29 19:33:12 CET 2014

Hi Zhihao,

On 1/29/2014 12:07 PM, Zhihao Tan wrote:
> Hi Jim,
>
> Thanks for your reply. Alright, I'm guessing that contrast really 
> doesn't make sense, and actually going back to look at the additional 
> genes that the (Day2.Fluffy - Day2.Smooth) + (Day5.Fluffy - 
> Day5.Smooth) brought up, it seems like it might mostly be noise, or 
> day specific effects.
>
> (Day2.Fluffy - Day2.Smooth) - (Day5.Fluffy - Day5.Smooth) would indeed 
> be an interesting set of genes to look at, and I will definitely try that.
>
> I do want to try and understand the best (whatever best means...) way 
> to get at the genes that are DE between fluffy and smooth though. I 
> can briefly think of three ways:
> - Lump all the samples, build a model with 1 effect (Phenotype), and 
> get a set of DE genes (Fluffy - Smooth). But that seems to be 
> averaging out the values, and not supplying information to the model 
> about a source of biological variation (Time) that we know about.
> - Lump all the samples, build a model with 2 effects (Phenotype and 
> Time), test (Day2.Fluffy - Day2.Smooth) and (Day5.Fluffy - 
> Day5.Smooth), get the intersection of the DE genes. This should 
> account for biological variation, though normalization of Day2 and 
> Day5 samples together would add a little noise?
> - Separate samples by day, test (Fluffy - Smooth) for each, get the 
> intersect. And only when I want to test for interaction between 
> effects do I build the 2 effect interaction model. This to me seems 
> the cleanest, but I'm not sure if that makes sense in the world of 
> biostatistics...

The conventional way to do this would be to fit a model like

design <- model.matrix(~phenotype*time)

where phenotype is a factor with the two levels (fluffy and smooth) and 
time is a factor with the two levels (2 and 5). This will result in a 
design matrix like this:

 > model.matrix(~phenotype*time)
   (Intercept) phenotypeSmooth time5 phenotypeSmooth:time5
1           1               0     0                     0
2           1               0     0                     0

Where phenotypeSmooth is inherently a contrast comparing Smooth - Fluffy 
after controlling for time, and time5 is a contrast of day5 - day2 after 
controlling for phenotype, and phenotypeSmooth:time5 tests the interaction.

Note that the second and third coefficient are not interpretable for any 
genes that have a significant interaction, so you should first look for 
genes with an interaction, and then look for genes that are different in 
Smooth - Fluffy only in the set of genes that do not have a significant 
interaction term.

Does that make sense?

Best,

Jim

>
> Hope to get your advice on this... and thanks once again!
>
> Cheers,
> Zhihao
>
>
>
>
> On Wed, Jan 29, 2014 at 6:17 AM, James W. MacDonald <jmacdon at uw.edu 
> <mailto:jmacdon at uw.edu>> wrote:
>
>     Hi Zhihao,
>
>
>     On Tuesday, January 28, 2014 6:56:20 PM, Zhihao Tan wrote:
>
>         Hi there,
>
>         I have a question on whether some of the contrasts I am making
>         in a
>         multifactorial experiment should actually be made. I don't
>         have a strong
>         grasp of GLMs, so I might be missing something conceptually,
>         and am hoping
>         someone can advise.
>
>         I am basically looking for genes that are differentially
>         expressed in a
>         certain phenotypic state (e.g. fluffy vs. smooth), but have
>         set it up with
>         2 time-points (Day 2 and Day 5). I have trouble setting up the
>         design using
>         an equation (columns seem to disappear) so have gone ahead and
>         created the
>         design matrix using the method in 3.3.1 of the manual (pasting
>         factors
>         together). The design looks like this (I have removed
>         replicates and many
>         samples to simplify):
>
>             Day2.Fluffy Day2.Smooth Day5.Fluffy Day5.Smooth
>         1            0           1           0           0
>         7            1           0           0           0
>         13           0           0           0           1
>         16           0           0           1           0
>         19           0           0           0           1
>         35           0           0           1           0
>         36           0           0           1           0
>
>         >From what I understand, the above design is set up for 2 main
>         effects
>         (phenotype and time), and if I reduce it to 1 main effect
>         (phenotype), I
>         get the design below.
>
>             Fluffy Smooth
>         1       0      1
>         7       1      0
>         13      0      1
>         16      1      0
>         19      0      1
>         35      1      0
>         36      1      0
>
>         The contrast I make in the latter case is basically (Fluffy -
>         Smooth). The
>         contrast that I did for the former case, and this is what I'm
>         unsure of, is
>         ((Day2.Fluffy - Day2.Smooth) + (Day5.Fluffy - Day5.Smooth)).
>         These tests
>         are definitely not equivalent, and I get different number of
>         sig. DE genes
>         for both (more for the 2 effect design). In my mind, it makes
>         sense,
>         because the experiment *is *set up with 2 effects, and
>         accounting for the
>
>         biological variation in your model should allow you to be more
>         powered to
>         detect DE genes. However, I've never seen a contrast like that
>         before. Does
>         it even make sense to have an addition sign in the equation?
>         What does that
>         actually mean? Should I instead make contrasts of (Day2.Fluffy -
>         Day2.Smooth) and (Day5.Fluffy - Day5.Smooth) and get the union
>         or intersect
>         of them?
>
>
>     The contrast you are using doesn't really make sense, because a
>     contrast is usually testing the difference between groups, so you
>     subtract rather than sum. If you were to use
>
>     (Day2.Fluffy - Day2.Smooth) - (Day5.Fluffy - Day5.Smooth)
>
>     then you would be testing the interaction of time and phenotype.
>     In other words the interaction looks for genes that are different
>     between fluffy and smooth, depending on the day. So if you think
>     the fluffiness of your samples is dependent on time, that is what
>     you would likely want to test.
>
>     Best,
>
>     Jim
>
>
>
>         Hope someone can help on this, and thanks in advance!
>
>         Regards,
>         Zhihao
>         Graduate Student
>         University of Washington
>
>                 [[alternative HTML version deleted]]
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>     --
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099