[R-sig-ME] Nested Mixed Models in lme4

Thu Nov 8 18:40:42 CET 2007

On 11/8/07, Marco Chiarandini <marco at imada.sdu.dk> wrote:
> Dear Prof. Bates,

> I am trying to use the function lmer from lme4 to
> analyse the following nested factorial design.

> I have three treatment factors (neighborhood,
> initial, k);
> I have three group factors crossing (size, dens,
> inst).

Did you mean to write (size, dens, type) there?

Also, by "factor" do you mean that you regard all of these variables
as categorical?  If so, you should check the form of the size variable
in the data frame.  It is being stored as a numeric variable, not as a
factor.  If you want to interpret this  variable as a categorical
factor you should convert it to a factor or, as seems likely in this
case, an ordered factor.  (See ?factor and ?ordered)

> I have one random factor (Subject or, in my case,
> inst) nested within the groups generated by the
> crossing of the group factors.

Nesting is automatically handled appropriately in lmer as long as the
levels of the inst factor are distinct. That is, if each distinct
level of the inst factor has a distinct label, which also appears
likely in this case because I see the first label contains G and 200
and I assume that these parts of the name correspond to the level of
the type and size variables.  If so, then you simply need to use inst
as the grouping factor for the random effects.

The only time that nesting must be explicitly stated is when the
levels of the variable(s) at the inner level(s) are incomplete.  I
call this "implicit nesting".  Suppose I choose 20 different plants
from an experimental plot and extract several seeds from each plant
then perform multiple analyses on each seed, I could label the plants
"A", "B", "C", ...., "T" and the seeds "a", "b","c", ... for each
plant.  This is implicit nesting in that I only have, say, 12 seed
labels but there may be one or two hundred seeds.  To specify a
particular seed I must not only specify its seed label but also the
plant from which it came.  If I just specify "seed" in a model formula
I will get an inappropriate model fit.  I need to somehow specify seed
within plant.

In my view this is not a characteristic of the experiment - it's just
a dumb way of labeling the seeds.  If you use labels like "Aa", "Ab",
..., as I suspect you have done for your "inst" factor, then the
problem goes away.

> I would like to write a formula for lmer of the kind:
>
> treat1 + treat2 + treat3 + group1 + group2 +
> group3 + Subject(group1*group2*group3)
>
> eventually plus some interaction terms.
>
> I am trying the following:
>
> mod <-
> lmer(err~k+initial+neighborhood+size+dens+type+(1|inst),data=Case3)

I think that specifcation corresponds to the model that you describe
above, although I am not quite sure what the distinction between a
treatment factor and a group factor is.

>  > summary(mod)
> Linear mixed-effects model fit by REML
> Formula: err ~ k + initial + neighborhood + size +
> dens + type + (1 |      inst)
>     Data: Case3
>     AIC   BIC logLik MLdeviance REMLdeviance
>   16472 16542  -8224      16433        16448
> Random effects:
>   Groups   Name        Variance Std.Dev.
>   inst     (Intercept)  2.61    1.62
>   Residual             49.23    7.02
> number of obs: 2430, groups: inst, 90

It appears that the random effect for the inst factor may be
unnecessary.  You may want to check what the log-likelihood for a
model with only fixed-effects is.

>
>  > anova(mod)
> Analysis of Variance Table
>               Df Sum Sq Mean Sq
> k             2   5473    2736
> initial       2  72168   36084
> neighborhood  2  18622    9311
> size          1     52      52
> dens          2      2       1
> type          1  31378   31378

> But results are very different from what I get in SAS

I can't really comment on that without knowing how you specified the
model in SAS and what analysis of variance results from SAS you are
comparing.  This analysis of variance table, like most such tables in
R, is the decomposition of the variation in the response according to
the terms in the order they were given in the formula.  As Bill
Venables describes in his famous (and, regretably, unpublished) paper
"Exegeses on linear models" (just search for the title in a search
engine) this is the only decomposition that makes sense but that does
not deter many people, including the authors of SAS, from creating
other decompositions that may on the surface appear to make sense but
do not withstand careful scrutiny.

It appears that you may have a completely balanced experiment here
(I'm guessing that it is a computer experiment) in which case the
decomposition is invariant to reordering of the terms in the model.
In general, if type, size and dens are blocking factors or
environmental factors (that is, they represent a known source of
variability and you wish to control for these factors in examining
your experimental factors) then they should be entered first in the
formula.

> and I cannot figure out why the anova method
> does not return the test for significance.

Ah, that's a long story.  One can calculate F-ratios for fixed-effects
terms in a linear mixed model but they don't have an F distribution
except in certain balanced cases.  Determining a p-value for a
fixed-effects term in a mixed model fit to unbalanced data is not
trivial.  At one time I did list p-values in such a table but they
were approximations and rather coarse approximations that erred in the
wrong direction.  Some users, quite reasonably, objected that these
could be dangerously misleading so my current solution is not to
return a p-value at all.

> If I try this formula, the computation never stops:
>
> mod <- lmer(err~initial+neighborhood+k+(1|inst) +
> (inst|type)+(inst|size) + (inst|dens),data=Case3)

You really, really don't want to try that.  The expression on the left
hand side of a random-effects term is treated as a linear model
formula from which a model matrix is evaluated.  The model matrix for
inst has 90 columns so each level of type is being modeled as having
90, possibly correlated, random effects associated with it.  The same
for size and dens.  90 correlated random effects requires estimation
of 90 variance parameters and 4005 covariance parameters.  The general
rule is that a factor on the left hand side of the '|' should have a
very small number of levels whereas a factor on the right hand side
should have a large number of levels.

>
> Which should be the right formula?
> In case, this is the structure of my data:
>
>  > str(Case3)
> 'data.frame':   2430 obs. of  8 variables:
>   $ err         : num  12.60 11.03  9.99  2.48
> 2.48 ...
>   $ initial     : Factor w/ 3 levels
> "greedy_cov","lightest_add",..: 2 2 2 1 1 1 3 3 3
> 2 ...
>   $ neighborhood: Factor w/ 3 levels
> "k-add","k-cov",..: 1 1 1 1 1 1 1 1 1 2 ...
>   $ k           : Factor w/ 3 levels "1","3","5":
> 1 2 3 1 2 3 1 2 3 1 ...
>   $ type        : Factor w/ 2 levels "G","U": 1 1
> 1 1 1 1 1 1 1 1 ...
>   $ size        : num  200 200 200 200 200 200 200
> 200 200 200 ...
>   $ dens        : Factor w/ 3 levels "L","M","H":
> 1 1 1 1 1 1 1 1 1 1 ...
>   $ inst        : Factor w/ 90 levels
> "G-200-0.1-1",..: 1 1 1 1 1 1 1 1 1 1 ...
>
>
>
> Thank you in advance for any suggestion you may
> provide.
>
> Best regards,
>
> Marco
>
>
>
>
> --
> Marco Chiarandini
> http://www.imada.sdu.dk/~marco
> Department of Mathematics             Email:
> marco at imada.sdu.dk
> and Computer Science,                 Phone: +45 6550 4031
> University of Southern Denmark        Fax: +45
> 6593 2691
>
>
>