[BioC] Multifactorial edgeR GLM design question (contrast that I should make)
James W. MacDonald
jmacdon at uw.edu
Wed Jan 29 19:33:12 CET 2014
Hi Zhihao,
On 1/29/2014 12:07 PM, Zhihao Tan wrote:
> Hi Jim,
>
> Thanks for your reply. Alright, I'm guessing that contrast really
> doesn't make sense, and actually going back to look at the additional
> genes that the (Day2.Fluffy - Day2.Smooth) + (Day5.Fluffy -
> Day5.Smooth) brought up, it seems like it might mostly be noise, or
> day specific effects.
>
> (Day2.Fluffy - Day2.Smooth) - (Day5.Fluffy - Day5.Smooth) would indeed
> be an interesting set of genes to look at, and I will definitely try that.
>
> I do want to try and understand the best (whatever best means...) way
> to get at the genes that are DE between fluffy and smooth though. I
> can briefly think of three ways:
> - Lump all the samples, build a model with 1 effect (Phenotype), and
> get a set of DE genes (Fluffy - Smooth). But that seems to be
> averaging out the values, and not supplying information to the model
> about a source of biological variation (Time) that we know about.
> - Lump all the samples, build a model with 2 effects (Phenotype and
> Time), test (Day2.Fluffy - Day2.Smooth) and (Day5.Fluffy -
> Day5.Smooth), get the intersection of the DE genes. This should
> account for biological variation, though normalization of Day2 and
> Day5 samples together would add a little noise?
> - Separate samples by day, test (Fluffy - Smooth) for each, get the
> intersect. And only when I want to test for interaction between
> effects do I build the 2 effect interaction model. This to me seems
> the cleanest, but I'm not sure if that makes sense in the world of
> biostatistics...
The conventional way to do this would be to fit a model like
design <- model.matrix(~phenotype*time)
where phenotype is a factor with the two levels (fluffy and smooth) and
time is a factor with the two levels (2 and 5). This will result in a
design matrix like this:
> model.matrix(~phenotype*time)
(Intercept) phenotypeSmooth time5 phenotypeSmooth:time5
1 1 0 0 0
2 1 0 0 0
Where phenotypeSmooth is inherently a contrast comparing Smooth - Fluffy
after controlling for time, and time5 is a contrast of day5 - day2 after
controlling for phenotype, and phenotypeSmooth:time5 tests the interaction.
Note that the second and third coefficient are not interpretable for any
genes that have a significant interaction, so you should first look for
genes with an interaction, and then look for genes that are different in
Smooth - Fluffy only in the set of genes that do not have a significant
interaction term.
Does that make sense?
Best,
Jim
>
> Hope to get your advice on this... and thanks once again!
>
> Cheers,
> Zhihao
>
>
>
>
> On Wed, Jan 29, 2014 at 6:17 AM, James W. MacDonald <jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>> wrote:
>
> Hi Zhihao,
>
>
> On Tuesday, January 28, 2014 6:56:20 PM, Zhihao Tan wrote:
>
> Hi there,
>
> I have a question on whether some of the contrasts I am making
> in a
> multifactorial experiment should actually be made. I don't
> have a strong
> grasp of GLMs, so I might be missing something conceptually,
> and am hoping
> someone can advise.
>
> I am basically looking for genes that are differentially
> expressed in a
> certain phenotypic state (e.g. fluffy vs. smooth), but have
> set it up with
> 2 time-points (Day 2 and Day 5). I have trouble setting up the
> design using
> an equation (columns seem to disappear) so have gone ahead and
> created the
> design matrix using the method in 3.3.1 of the manual (pasting
> factors
> together). The design looks like this (I have removed
> replicates and many
> samples to simplify):
>
> Day2.Fluffy Day2.Smooth Day5.Fluffy Day5.Smooth
> 1 0 1 0 0
> 7 1 0 0 0
> 13 0 0 0 1
> 16 0 0 1 0
> 19 0 0 0 1
> 35 0 0 1 0
> 36 0 0 1 0
>
> >From what I understand, the above design is set up for 2 main
> effects
> (phenotype and time), and if I reduce it to 1 main effect
> (phenotype), I
> get the design below.
>
> Fluffy Smooth
> 1 0 1
> 7 1 0
> 13 0 1
> 16 1 0
> 19 0 1
> 35 1 0
> 36 1 0
>
> The contrast I make in the latter case is basically (Fluffy -
> Smooth). The
> contrast that I did for the former case, and this is what I'm
> unsure of, is
> ((Day2.Fluffy - Day2.Smooth) + (Day5.Fluffy - Day5.Smooth)).
> These tests
> are definitely not equivalent, and I get different number of
> sig. DE genes
> for both (more for the 2 effect design). In my mind, it makes
> sense,
> because the experiment *is *set up with 2 effects, and
> accounting for the
>
> biological variation in your model should allow you to be more
> powered to
> detect DE genes. However, I've never seen a contrast like that
> before. Does
> it even make sense to have an addition sign in the equation?
> What does that
> actually mean? Should I instead make contrasts of (Day2.Fluffy -
> Day2.Smooth) and (Day5.Fluffy - Day5.Smooth) and get the union
> or intersect
> of them?
>
>
> The contrast you are using doesn't really make sense, because a
> contrast is usually testing the difference between groups, so you
> subtract rather than sum. If you were to use
>
> (Day2.Fluffy - Day2.Smooth) - (Day5.Fluffy - Day5.Smooth)
>
> then you would be testing the interaction of time and phenotype.
> In other words the interaction looks for genes that are different
> between fluffy and smooth, depending on the day. So if you think
> the fluffiness of your samples is dependent on time, that is what
> you would likely want to test.
>
> Best,
>
> Jim
>
>
>
> Hope someone can help on this, and thanks in advance!
>
> Regards,
> Zhihao
> Graduate Student
> University of Washington
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list