[R-sig-ME] Help with lme modelling

Wed Apr 1 22:24:50 CEST 2009

Dear Sanja,

Let's temporarily put aside the issue pointed by Luca Borger (see at
end).

Your (first) problem is that anova.lme gives you by default *sequential*
F-tests, i.e. the "(Intercept)" lines tests the "model enhancement" of
the "X=someconstant" model (n° 1) over the "X=0" model (n° 0), the
second line "treatment" tests the "X=someconstant
+someotherconstant*treatment" model (n° 2) against the previous one (n°
1), the "age" lines tests the "X=someconstant
+someotherconstant*treatment+yetanotherconstant*age" model (n° 3)
against n° 2 and the "age*treatment" line tests the "X=someconstant
+someotherconstant*treatment+yetanotherconstant*age
+onemoreonstant*treatment*age" model (n° 4) against n° 3.

What you probably expected was probably what was called an "Analysis of
variance table" in the "classical" sense, where each factor is tested
against the full model, i. e. testing model n° 4 minus the relevant
factor vs model n° 4 in full. anova.lme allows that :

> anova(lme(expr.val~treatment*age,data=Chapuis,random=~1|
lab),type="marginal")
              numDF denDF   F-value p-value
(Intercept)       1    13 211.58292  <.0001
treatment         1    13   0.01050  0.9199
age               1    13  18.09537  0.0009
treatment:age     1    13  21.53626  0.0005

However, there are some serious problems with this approach. You should
use RSiteSearch to find Bill Venable's reflections on linear regression
(I keep forgetting the exact title and address of this paper. Shame on
me....). I won't paraphrase it, but just look :
> options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly" 
# This is R default...

> anova(lme(expr.val~treatment*age,data=Chapuis,random=~1|
lab),type="marginal")
              numDF denDF   F-value p-value
(Intercept)       1    13 211.58292  <.0001
treatment         1    13   0.01050  0.9199
age               1    13  18.09537  0.0009
treatment:age     1    13  21.53626  0.0005
# We already saw that...

# Let's try to get the (in)famous so called "Type III sum of squares" so
# popularized by SAS
> options(contrasts=c("contr.helmert","contr.poly"))
# See Bill Venable's paper and/or V&R for explanations. This works...
# BUT :
> anova(lme(expr.val~treatment*age,data=Chapuis,random=~1|
lab),type="marginal")
              numDF denDF   F-value p-value
(Intercept)       1    13 219.17719  <.0001
treatment         1    13  20.45202  0.0006
age               1    13   2.51384  0.1369
treatment:age     1    13  21.53626  0.0005

Ouch : quite different results !

The point is : when you have an unbalanced design, the results
(estimates and F-values) for the main effects in an analysis including
interactions depend on the way your factors are (re-)coded. The
so-called "type III SS" attempt to give answers for an (imaginary)
population as unbalanced as your sample.

V&R argue (strongly, and rightly, IMHO) against the use of this
computational artifact, dating back to the time when such analyses were
batched on expensive computers.

They also argue (even more importantly) that the demonstration of the
existence of an interaction effect renders meaningless the question of
the research of "a global effect" for the main factors. They state that
they "have not het seen a situation where this made statistical sense".
THIS IS THE ROOT OF YOUR PROBLEMS.

What you can do is to search for an effect of one factor in each of the
subsamples defined by the second factor :

> options(contrasts=c("contr.treatment","contr.poly"))
> anova(lme(expr.val~treatment,data=Rogic,subset=age=="young",random=~1|
lab),type="marginal")
            numDF denDF  F-value p-value
(Intercept)     1     4 27.49423  0.0063
treatment       1     4 15.38910  0.0172
> anova(lme(expr.val~treatment,data=Rogic,subset=age=="old",random=~1|
lab),type="marginal")
            numDF denDF   F-value p-value
(Intercept)     1     9 221.97631  <.0001
treatment       1     9   0.01962  0.8917

# i. e. treatment effect can be demonstrated only in young
# Conversively :

> anova(lme(expr.val~age,data=Rogic,subset=treatment=="sal",random=~1|
lab),type="marginal")
            numDF denDF   F-value p-value
(Intercept)     1     6 119.84139  <.0001
age             1     6   3.26905  0.1206
> anova(lme(expr.val~age,data=Rogic,subset=treatment=="KA",random=~1|
lab),type="marginal")
            numDF denDF  F-value p-value
(Intercept)     1     6 642.5785  <.0001
age             1     6  36.7017   9e-04

# i. e. age effect can be demonstrated only when treatment is KA.

Note : I jumped directly to the ANOVA tables. This is bad : in your
case, the center "B" has only old subjects, and the second analysis
should have spit out a warning...

Note 2 : your initial analysis using default contrasts used "old" and
"KA" as base levels for their respective factors. You saw that the
treatment effect cannot be demonstrated at the base level for age (old,
in your case), that an age effect exists in subjects receiving the base
treatment (KA in your case), and that an interaction destroyed these
conclusions for the other levels. Same (qualitative) conclusions. The
p-value differ because (among other) the number of subjects used to
assess the variance differ.

Now for the number of levels of the random effect. If your experiment is
balanced, lme will go ahead and give you what a classical "mixed-effect
analysis of variance" (that you can get with aov()) gives. However, your
sampling plan is wildly unbalanced, and inter-group variance assessment
may be poor. I defer to more knowledgeable people's judgement, but I'd
tend to worry only if your "real" data were strongly unbalanced.

Hope this helps,

					Emmanuel Charpentier

Le mercredi 01 avril 2009 à 11:44 -0700, Sanja Rogic a écrit :
> I would like to apologize in advance if this is not the right forum
> for posting this message, since I am using nlme instead of lme4
> package, but it seamed to me that this is the best place for Mixed
> Effects Models (MEM) questions.
> 
> I've recently started using MEM for the differential expression
> analysis of genes in cases where I want to analyze compatible datasets
> from different experiments/labs (modelling labs as random effects). I
> was pretty happy with the results since MEM allows for direct use of
> raw expression data, unlike some other meta-analysis approaches and
> thus seem to be more sensitive.
> 
> However, I discovered several cases where MEM failed to make correct
> predictions and I would like to find a way to automatically detect
> these in the future (I am doing the analysis on thousands of genes so
> it is not feasible to graphically inspect the quality of a model fit
> for each one separately).
> 
> Here is an example:
> 
> #data frame
> 
> > data
>           expr.val   lab treatment   age
> X1        4.751217     A       sal   old
> X2        4.660610     A       sal   old
> X3        4.810334     A       sal   old
> X4        5.235475     A        KA   old
> X5        5.414665     A        KA   old
> X6        4.451858     A        KA   old
> X7        4.812522     A       sal young
> X8        6.346419     A       sal young
> X9        5.267596     A       sal young
> X10       3.823031     A        KA young
> X11       3.619814     A        KA young
> X12       3.539568     A        KA young
> X13       5.671991     B       sal   old
> X14       5.715370     B       sal   old
> X15       5.694324     B       sal   old
> X16       5.457560     B        KA   old
> X17       5.454112     B        KA   old
> X18       5.430781     B        KA   old
> 
> As you can see the data comes from two labs, A and B, there are two
> treatments, KA and sal, and the effect of these is my main interest
> and there are also two age groups.
> 
> Fitting linear MEM (I expect some interaction between two fixed
> effects, treatment and age):
> 
> > data.lme<- lme(expr.val~treatment*age,data=data,random=~1|lab)
> 
> > summary(data.lme)
> Linear mixed-effects model fit by REML
>  Data: m
>        AIC      BIC    logLik
>   33.76913 37.60347 -10.88456
> 
> Random effects:
>  Formula: ~1 | Study
>         (Intercept)  Residual
> StdDev:   0.4553264 0.3960901
> 
> Fixed effects: E ~ Treatment * Age
>                           Value Std.Error DF   t-value p-value
> (Intercept)            5.240742 0.3602901 13 14.545894  0.0000
> Treatmentsal          -0.023434 0.2286827 13 -0.102475  0.9199
> Ageyoung              -1.276539 0.3000890 13 -4.253867  0.0009
> Treatmentsal:Ageyoung  1.838143 0.3960901 13  4.640719  0.0005
>  Correlation:
>                       (Intr) Trtmnt Ageyng
> Treatmentsal          -0.317
> Ageyoung              -0.242  0.381
> Treatmentsal:Ageyoung  0.183 -0.577 -0.660
> 
> Standardized Within-Group Residuals:
>        Min         Q1        Med         Q3        Max
> -1.6738372 -0.3845786 -0.2229425  0.4311380  2.1987599
> 
> Number of Observations: 18
> Number of Groups: 2
> 
> > anova(data.lme)
>               numDF denDF   F-value p-value
> (Intercept)       1    13 232.76945  <.0001
> Treatment         1    13   9.96020  0.0076
> Age               1    13   2.51385  0.1369
> Treatment:Age     1    13  21.53627  0.0005
> 
> 
> 
> 
> What I do when automatically analyzing all the genes is to look at the
> p-value from anova function to assess the significance of Treatment
> effect and at the sign of estimated parameter (Treatmentsal) from the
> summary function to decide on the direction of change (is it more
> expressed after sal or KA treatment). In the above example the results
> indicate that Treatment has significant effect and that there is more
> expression for KA effect (Treatmentsal=-0.023434). However, when I
> plot the raw data and superimpose the fitted values it is obvious that
> the fit is not good for KA-old datapoints and this is probably caused
> by contradictory data wrt to treatment effect between the two labs
> (sal<KA for A and sal>KA for B).
> 
> My questions are:
> 
> 1. How is the F-value in  anova function computed? I would like to
> understand why I got such a small value when data is contradictory.
> 
> 2. The Treatmentsal value seems to correspond only to the age=old
> (change from a baseline treatment=KA, age=old). How do I compute the
> Treatmentsal value for both age groups?
> 
> 3. In general, what functions/parameters can I used in automated way
> to decide if there is a significant effect of, for example, treatment
> KA?
> 
> Sorry for the lengthy post and superficial understanding of how MEM
> work (not much background in statistics). I would very much appreciate
> any help on these issues.
> 
> Cheers,
> Sanja Rogic
>