[R-sig-ME] Data sheet notation and model structure for GLMM with 3 non-factorial factors

Thu Sep 24 14:10:44 CEST 2009

On Thu, Sep 24, 2009 at 1:22 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
> Hi R users,
>
> I have 3 factors in a non-factorial design (G, K and N), as well as
> two time periods (Year) and a random factor (Site), with Plant numbers
> as the response variable.
>
> My 1st question relates to the the notation of the treatments in the
> data frame. Is it appropriate to use an expanded treatment notation,
> such as this, when using glmer{lme4}:
>
> Site    Year    Plant   G       K       N
> A       1       5       0       0       0
> A       1       4       1       0       0
> A       1       7       0       1       0
> A       1       10      0       0       1
> A       2       3       0       0       0
> A       2       4       1       0       0
> A       2       8       0       1       0
> A       2       12      0       0       1
> B       1       7       0       0       0
> B       1       3       1       0       0
> B       1       7       0       1       0
> B       1       12      0       0       1
> B       2       4       0       0       0
> B       2       5       1       0       0
> B       2       6       0       1       0
> B       2       11      0       0       1
>
> With the model
>
> m1<-glmer(Plant~G+K+N+Year+(1|Site), ...)
>
> Or is it better to use a single column for the treatments, like this:
>
> Site    Year    Plant   Treatment
> A       1       5       C
> A       1       4       G
> A       1       7       K
> A       1       10      N
> A       2       3       C
> A       2       4       G
> A       2       8       K
> A       2       12      N
> B       1       7       C
> B       1       3       G
> B       1       7       K
> B       1       12      N
> B       2       4       C
> B       2       5       G
> B       2       6       K
> B       2       11      N
>
> With the following model:
> m1<-glmer(Plants~Treatment+Year+(1|Site), ...)

The latter is preferred.  R will generate the indicator columns for
the levels of the Treatment factor (the 0/1 columns shown in the first
form) and, when appropriate, reduce them to a set of 2 "contrasts" in
the model.  (The reason for quoting the word "contrasts" is that there
is a formal mathematical definition of a contrast but the linear
combinations generated by R do not always satisfy this definition.
The method and results are correct, it is just the name that is
inaccurate.)

The reason that the latter is preferred is that it is easier to
maintain the data in a consistent form (factors maintain consistency
and are easy to check in the output from str() or summary(), whereas
indicator columns have inter-column dependencies that must be checked
separately) and the "when appropriate" clause above.  Determining a
useful parameterization of a linear model incorporating factors is
subtle and a lot of code in the R function model.matrix is devoted to
a symbolic analysis designed to get this right.  Also, you can, if you
wish, change the parameterization (see ?contrasts).