[R-sig-ME] Data sheet notation and model structure for GLMM with 3 non-factorial factors
Raldo Kruger
raldo.kruger at gmail.com
Sat Sep 26 19:31:51 CEST 2009
Thanks for the help!
On Sat, Sep 26, 2009 at 2:43 PM, Douglas Bates <bates at stat.wisc.edu> wrote:
> Sat, Sep 26, 2009 at 3:11 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
>> Hi Douglas,
>
>> Many thanks for the input. I've run two analyses on the same dataset
>> using 1) indicator columns and the 2) a single 'factor / treatment'
>> column for the non-factorial design described in my previous e-mail,
>> and the results were identical (great!).
>
>> However, I did the same for a dataset with a factorial design (N, G,
>> N*G, i.e. there were plots with N, plots with G, and plots with both N
>> and G), and the results for the main effects are identical, but the
>> estimates for the interaction effects (N*G) are different between the
>> two analyses (see below). Could you help me make sense of that please
>> (i.e. which one is correct?) !
>
> Generally when you have the possibility of having N and G combined you
> would treat the design as a two-factor two-level factorial. That is,
> one factor for presence or absence of G and another factor for
> presence or absence of N. You could treat it as a single factor with
> four levels (neither, G only, N only and both N and G) but, as you
> have seen you need to translate between the representations.
>
> In the two-factor, two-level factorial design, let a be the estimate
> of the main effect for G, b be the estimate of the main effect for N,
> and c be the interaction estimate. In your example a = 0.14929, b =
> 0.03766 and c = -0.31633. Then the estimated cell mean for the NG
> cell is a + b + c =
>> 0.03766 + 0.14929 + (-0.31633)
> [1] -0.12938
>
>> Thanks,
>> Raldo
>>
>> With expanded treatment notation-
>> Fixed effects:
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) 2.92060 0.23834 12.254 < 2e-16 ***
>> N 0.03766 0.03486 1.080 0.2801
>> G 0.14929 0.03395 4.397 1.10e-05 ***
>> Yearthree -2.85449 0.10664 -26.768 < 2e-16 ***
>> Yeartwo -1.88175 0.06844 -27.494 < 2e-16 ***
>> N:G -0.31633 0.04953 -6.386 1.70e-10 ***
>> N:Yearthree 0.15710 0.14428 1.089 0.2762
>> N:Yeartwo 0.14736 0.09305 1.584 0.1133
>> G:Yearthree -0.25107 0.15430 -1.627 0.1037
>> G:Yeartwo 0.07550 0.09200 0.821 0.4118
>> N:G:Yearthree 0.36353 0.20810 1.747 0.0807 .
>> N:G:Yeartwo -0.01158 0.12996 -0.089 0.9290
>>
>> With single column treatment notation-
>> Fixed effects:
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) 2.92057 0.23836 12.253 < 2e-16 ***
>> TreatG 0.14928 0.03395 4.397 1.10e-05 ***
>> TreatN 0.03767 0.03486 1.080 0.279928
>> TreatNG -0.12938 0.03639 -3.556 0.000377 ***
>> Yearthree -2.85448 0.10664 -26.768 < 2e-16 ***
>> Yeartwo -1.88175 0.06844 -27.494 < 2e-16 ***
>> TreatG:Yearthree -0.25109 0.15430 -1.627 0.103693
>> TreatN:Yearthree 0.15711 0.14428 1.089 0.276199
>> TreatNG :Yearthree 0.26959 0.14636 1.842 0.065483 .
>> TreatG:Yeartwo 0.07549 0.09200 0.820 0.411941
>> TreatN:Yeartwo 0.14735 0.09305 1.583 0.113308
>> TreatNG :Yeartwo 0.21118 0.09558 2.210 0.027139 *
>>
>>
>> On Thu, Sep 24, 2009 at 2:10 PM, Douglas Bates <bates at stat.wisc.edu> wrote:
>>> On Thu, Sep 24, 2009 at 1:22 AM, Raldo Kruger <raldo.kruger at gmail.com> wrote:
>>>> Hi R users,
>>>>
>>>> I have 3 factors in a non-factorial design (G, K and N), as well as
>>>> two time periods (Year) and a random factor (Site), with Plant numbers
>>>> as the response variable.
>>>>
>>>> My 1st question relates to the the notation of the treatments in the
>>>> data frame. Is it appropriate to use an expanded treatment notation,
>>>> such as this, when using glmer{lme4}:
>>>>
>>>> Site Year Plant G K N
>>>> A 1 5 0 0 0
>>>> A 1 4 1 0 0
>>>> A 1 7 0 1 0
>>>> A 1 10 0 0 1
>>>> A 2 3 0 0 0
>>>> A 2 4 1 0 0
>>>> A 2 8 0 1 0
>>>> A 2 12 0 0 1
>>>> B 1 7 0 0 0
>>>> B 1 3 1 0 0
>>>> B 1 7 0 1 0
>>>> B 1 12 0 0 1
>>>> B 2 4 0 0 0
>>>> B 2 5 1 0 0
>>>> B 2 6 0 1 0
>>>> B 2 11 0 0 1
>>>>
>>>> With the model
>>>>
>>>> m1<-glmer(Plant~G+K+N+Year+(1|Site), ...)
>>>>
>>>> Or is it better to use a single column for the treatments, like this:
>>>>
>>>> Site Year Plant Treatment
>>>> A 1 5 C
>>>> A 1 4 G
>>>> A 1 7 K
>>>> A 1 10 N
>>>> A 2 3 C
>>>> A 2 4 G
>>>> A 2 8 K
>>>> A 2 12 N
>>>> B 1 7 C
>>>> B 1 3 G
>>>> B 1 7 K
>>>> B 1 12 N
>>>> B 2 4 C
>>>> B 2 5 G
>>>> B 2 6 K
>>>> B 2 11 N
>>>>
>>>> With the following model:
>>>> m1<-glmer(Plants~Treatment+Year+(1|Site), ...)
>>>
>>> The latter is preferred. R will generate the indicator columns for
>>> the levels of the Treatment factor (the 0/1 columns shown in the first
>>> form) and, when appropriate, reduce them to a set of 2 "contrasts" in
>>> the model. (The reason for quoting the word "contrasts" is that there
>>> is a formal mathematical definition of a contrast but the linear
>>> combinations generated by R do not always satisfy this definition.
>>> The method and results are correct, it is just the name that is
>>> inaccurate.)
>>>
>>> The reason that the latter is preferred is that it is easier to
>>> maintain the data in a consistent form (factors maintain consistency
>>> and are easy to check in the output from str() or summary(), whereas
>>> indicator columns have inter-column dependencies that must be checked
>>> separately) and the "when appropriate" clause above. Determining a
>>> useful parameterization of a linear model incorporating factors is
>>> subtle and a lot of code in the R function model.matrix is devoted to
>>> a symbolic analysis designed to get this right. Also, you can, if you
>>> wish, change the parameterization (see ?contrasts).
>>>
>>
>>
>>
>> --
>> Raldo
>>
>
--
Raldo
More information about the R-sig-mixed-models
mailing list