[R] Anova and unbalanced designs

Mon Feb 16 20:46:39 CET 2009

Tal,

The reparametirized cell means model can help with the understanding of what the individual terms mean in an analysis with contrasts.  The cell means model is y=W*mu + e, where mu is the vector of the cell means (the mean for each group) and W is just a stretched identity matrix, this model just fits the means of each cell without any comparisons/contrasts.  The reparameterized cell means model is y=W*Ai * A*mu + e, where Ai and A are inverses of each other and determine the set of contrasts (I tend to think of A as the contrast matrix and Ai as the dummy variable encoding matrix, but in some cases Ai is called the contrast matrix).  Basically this leads to x=W*Ai and beta= A*mu for the standard regression model of y=x*beta+e, so R is creating x for you and we just need to find A (the inverse of Ai) to see what beta really means.

We can start by creating some labels (assuming 5 groups for example) and loading the MASS package for some prettiness later:

library(MASS)

betas <- paste('b',0:4, sep='' )
mus <- paste('mu',1:5, sep='' )

Now, let's look at what Helmert contrasts give us:

Ai1 <- cbind(1, contr.helmert(5))
A1 <- solve(Ai1)

A1txt <- as.character(fractions(A1))

paste( betas, '=', apply(A1txt, 1, paste, mus, sep='*', collapse=' + '))

So beta0 (the intercept) is just the mean of the 5 groups, beta1 compares the first group to the second (actually half the first to half the second, this matters for interpreting the beta or confidence intervals, but not hypothesis tests).
Then beta2 compares the average of the first 2 groups to the 3rd group (with an extra 1/3rd in there, this makes the original Ai matrix and x matrix prettier).  Beta3 and beta4 compare groups 4 and 5 to the mean of the previous ones respectively.

Now look at summation contrasts (the contrasts sum to 0)

Ai2 <- cbind(1, contr.sum(5))
A2 <- solve(Ai2)
dimnames(A2) <- list(betas,mus)

fractions(A2)

The beta0 coefficient is still the overall mean and with a little algebra it can be seen that the other rows/betas measure the difference between the cell means (except the last) and the overall mean (just replace 4/5 with 1-1/5).

Now for the non-orthogonal treatment contrasts:

Ai3 <- cbind(1, contr.treatment(5))
A3 <- solve(Ai3)
dimnames(A3) <- list(betas, mus)
fractions(A3)

Now beta0 is not a mean of all the groups, but the mean of the first (reference) group.  The other betas are then the differences between the other groups and the reference group.

Polynomial contrasts are a bit more difficult to interpret:

Ai4 <- cbind(1, contr.poly(5))
A4 <- solve(Ai4)
dimnames(A4) <- list(betas, mus)
zapsmall(A4)
matplot(1:5, t(A4), type='b')

The graph is probably the easiest to interpret, the intercept is still the overall mean, beta1 shows a linear relationship, beta2 follows a quadratic, etc.  These are only meaningful if the groups are ordered and the same distance apart.

We can use the same idea in reverse to create our own contrasts, suppose we want to compare group 1 (control) to the mean of the rest, then compare groups 2 and 3, compare groups 4 and 5, then compare the mean of groups 2 and 3 to the mean of groups 4 and 5, we can do either of the following (depending on what we want beta0 to mean):

A6 <- rbind(1/5,
                c(-4,  1,  1,  1,  1)/4,
                c( 0, -1,  1,  0,  0),
                c( 0,  0,  0, -1,  1),
                c( 0, -1, -1,  1,  1)/2 )
fractions(A6)
zapsmall(solve(A6))

A7 <- rbind(c(1, 0,  0,  0,  0),
                c(-4,  1,  1,  1,  1)/4,
                c( 0, -1,  1,  0,  0),
                c( 0,  0,  0, -1,  1),
                c( 0, -1, -1,  1,  1)/2 )
fractions(A7)
zapsmall(solve(A7))

Just use the above Ai matricies to create x (use the C (note capital) or contrasts functions) and the individual terms will have the desired interpretations.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Tal Galili
> Sent: Sunday, February 15, 2009 3:16 AM
> To: John Fox
> Cc: r-help at r-project.org; Michael Friendly; Nils Skotara; Peter
> Dalgaard
> Subject: Re: [R] Anova and unbalanced designs
>
> Dear John - thank you for your detailed answer and help.
> Your answer encourages me to ask further: by choosing different
> contrasts,
> what are the different hypothesis which are being tested? (or put
> differently - should I prefer contr.sum over  contr.poly or
> contr.helmert,
> or does this makes no difference ?)
> How should this question be approached/answered ?
>
> I see in the ?contrasts in R that the referenced reading is:
> "Chambers, J. M. and Hastie, T. J. (1992) *Statistical models.* Chapter
> 2
> of *Statistical Models in S* eds J. M. Chambers and T. J. Hastie,
> Wadsworth
> & Brooks/Cole."
> Yet I must admit I don't have this book readily available (not on the
> web,
> nor in my local library), so other recommended sources would be of
> great
> help.
>
>
> For future reference I add here a some tinkering of the code to show
> how
> implementing different contrasts will resort in different SS type III
> analysis results:
>
>  phase <- factor(rep(c("pretest", "posttest", "followup"), c(5, 5, 5)),
>      levels=c("pretest", "posttest", "followup"))
>  hour <- ordered(rep(1:5, 3))
>  idata <- data.frame(phase, hour)
>
>  contrasted.treatment <- C(OBrienKaiser$treatment, "contr.treatment")
>  mod.ok.contr.treatment <- lm(cbind(pre.1, pre.2, pre.3, pre.4, pre.5,
>                      post.1, post.2, post.3, post.4, post.5,
>                       fup.1, fup.2, fup.3, fup.4, fup.5) ~
> contrasted.treatment*gender,     data=OBrienKaiser)
>  contrasted.treatment <- C(OBrienKaiser$treatment, "contr.helmert")
>  mod.ok.contr.helmert <- lm(cbind(pre.1, pre.2, pre.3, pre.4, pre.5,
>                      post.1, post.2, post.3, post.4, post.5,
>                       fup.1, fup.2, fup.3, fup.4, fup.5) ~
> contrasted.treatment*gender,     data=OBrienKaiser)
>  contrasted.treatment <- C(OBrienKaiser$treatment, "contr.poly")
>  mod.ok.contr.poly <- lm(cbind(pre.1, pre.2, pre.3, pre.4, pre.5,
>                      post.1, post.2, post.3, post.4, post.5,
>                       fup.1, fup.2, fup.3, fup.4, fup.5) ~
> contrasted.treatment*gender,     data=OBrienKaiser)
>  contrasted.treatment <- C(OBrienKaiser$treatment, "contr.sum")
>  mod.ok.contr.sum <- lm(cbind(pre.1, pre.2, pre.3, pre.4, pre.5,
>                      post.1, post.2, post.3, post.4, post.5,
>                       fup.1, fup.2, fup.3, fup.4, fup.5) ~
> contrasted.treatment*gender,     data=OBrienKaiser)
>
> # this is one result:
> (Anova(mod.ok.contr.treatment, idata=idata, idesign=~phase*hour, type =
> "III"))
> # all of the other contrasts will now give the same outcome:  (does
> that
> mean there shouldn't be a preference of using one over the other ?)
> (Anova(mod.ok.contr.helmert, idata=idata, idesign=~phase*hour, type =
> "III"))
> (Anova(mod.ok.contr.poly, idata=idata, idesign=~phase*hour, type =
> "III"))
> (Anova(mod.ok.contr.sum, idata=idata, idesign=~phase*hour, type =
> "III"))
>
>
>
>
> With regards,
> Tal
>
>
>
>
> On Sat, Feb 14, 2009 at 7:09 PM, John Fox <jfox at mcmaster.ca> wrote:
>
> > Dear Tal,
> >
> > > -----Original Message-----
> > > From: Tal Galili [mailto:tal.galili at gmail.com]
> > > Sent: February-14-09 10:23 AM
> > > To: John Fox
> > > Cc: Peter Dalgaard; Nils Skotara; r-help at r-project.org; Michael
> Friendly
> > > Subject: Re: [R] Anova and unbalanced designs
> > >
> > > Hello John and other R mailing list members.
> > >
> > > I've been following your discussions regarding the Anova command
> for the
> > SS
> > > type 2/3 repeated measures Anova, and I have a question:
> > >
> > > I found that when I go from using type II to using type III, the
> summary
> > > model is suddenly added with an "intercept" term (example in the
> end of
> > the
> > > e-mail). So my question is
> > > 1) why is this "intercept" term added (in SS type "III" vs the type
> > "II")?
> >
> > The computational approach taken in Anova() makes it simpler to
> include the
> > intercept in the "type-III" tests and not to include it in the "type-
> II"
> > tests.
> >
> > > 2) Can/should this "intercept" term be removed ? (or how should it
> be
> > > interpreted ?)
> >
> > The test for the intercept is rarely of interest. A "type-II" test
> for the
> > intercept would test that the unconditional mean of the response is
> 0; a
> > "type-III" test for the intercept would test that the constant term
> in the
> > full model fit to the data is 0. The latter depends upon the
> > parametrization
> > of the model (in the case of an ANOVA model, what kind of "contrasts"
> are
> > used). You state that the example that you give is taken from ?Anova
> but
> > there's a crucial detail that's omitted: The help file only gives the
> > "type-II" tests; the "type-III" tests are also reasonable here, but
> they
> > depend upon having used "contr.sum" (or another set of contrasts
> that's
> > orthogonal in the row basis of the model matrix) for the between-
> subject
> > factors, treatment and gender. This detail is in the data set:
> >
> > > OBrienKaiser$gender
> >  [1] M M M F F M M F F M M M F F F F
> > attr(,"contrasts")
> > [1] contr.sum
> > Levels: F M
> >
> > > OBrienKaiser$treatment
> >  [1] control control control control control A       A       A
> A
> > B       B
> > [12] B       B       B       B       B
> > attr(,"contrasts")
> >        [,1] [,2]
> > control   -2    0
> > A          1   -1
> > B          1    1
> > Levels: control A B
> >
> > With proper contrast coding, the "type-III" test for the intercept
> tests
> > that the mean of the cell means (the "grand mean") is 0.
> >
> > Had the default dummy-coded contrasts (from contr.treatment) been
> used, the
> > tests would not have tested reasonable hypotheses. My advice, from
> the help
> > file: "Be very careful in formulating the model for type-III tests,
> or the
> > hypotheses tested will not make sense."
> >
> > I hope this helps,
> >  John
> >
> > >
> > > My purpose is to be able to use the Anova for analyzing an
> experiment
> > with
> > a
> > > 2 between and 3 within factors, where the between factors are not
> > balanced,
> > > and the within factors are (that is why I can't use the aov
> command).
> > >
> > >
> > > #---code start
> > >
> > > #---code start
> > >
> > > #---code start
> > >
> > > # (taken from the ?Anova help file)
> > >
> > > phase <- factor(rep(c("pretest", "posttest", "followup"), c(5, 5,
> 5)),
> > >     levels=c("pretest", "posttest", "followup"))
> > > hour <- ordered(rep(1:5, 3))
> > > idata <- data.frame(phase, hour)
> > > idata
> > > mod.ok <- lm(cbind(pre.1, pre.2, pre.3, pre.4, pre.5,
> > >                      post.1, post.2, post.3, post.4, post.5,
> > >                      fup.1, fup.2, fup.3, fup.4, fup.5) ~
> > treatment*gender,
> > >                 data=OBrienKaiser)
> > >
> > > # now we have two options
> > > # option one is to use type II:
> > >
> > > (av.ok <- Anova(mod.ok, idata=idata, idesign=~phase*hour, type =
> "II"))
> > >
> > >
> > > #output:
> > > Type II Repeated Measures MANOVA Tests: Pillai test statistic
> > >                             Df test stat approx F num Df den Df
> Pr(>F)
> > > treatment                    2    0.4809   4.6323      2     10
> 0.0376868
> > *
> > > gender                       1    0.2036   2.5558      1     10
> 0.1409735
> > > treatment:gender             2    0.3635   2.8555      2     10
> 0.1044692
> > > phase                        1    0.8505  25.6053      2      9
> 0.0001930
> > ***
> > > treatment:phase              2    0.6852   2.6056      4     20
> 0.0667354
> > .
> > > gender:phase                 1    0.0431   0.2029      2      9
> 0.8199968
> > > treatment:gender:phase       2    0.3106   0.9193      4     20
> 0.4721498
> > > hour                         1    0.9347  25.0401      4      7
> 0.0003043
> > ***
> > > treatment:hour               2    0.3014   0.3549      8     16
> 0.9295212
> > > gender:hour                  1    0.2927   0.7243      4      7
> 0.6023742
> > > treatment:gender:hour        2    0.5702   0.7976      8     16
> 0.6131884
> > > phase:hour                   1    0.5496   0.4576      8      3
> 0.8324517
> > > treatment:phase:hour         2    0.6637   0.2483     16      8
> 0.9914415
> > > gender:phase:hour            1    0.6950   0.8547      8      3
> 0.6202076
> > > treatment:gender:phase:hour  2    0.7928   0.3283     16      8
> 0.9723693
> > > ---
> > > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> > >
> > >
> > > # option two is to use type III, and then get an added intercept
> term:
> > >  (av.ok <- Anova(mod.ok, idata=idata, idesign=~phase*hour, type =
> "III"))
> > >
> > >
> > > # here is the output:
> > > Type III Repeated Measures MANOVA Tests: Pillai test statistic
> > >                             Df test stat approx F num Df den Df
> Pr(>F)
> > > (Intercept)                  1     0.967  296.389      1     10
> 9.241e-09
> > ***
> > > treatment                    2     0.441    3.940      2     10
> 0.0547069
> > .
> > > gender                       1     0.268    3.659      1     10
> 0.0848003
> > .
> > > treatment:gender             2     0.364    2.855      2     10
> 0.1044692
> > > phase                        1     0.814   19.645      2      9
> 0.0005208
> > ***
> > > treatment:phase              2     0.696    2.670      4     20
> 0.0621085
> > .
> > > gender:phase                 1     0.066    0.319      2      9
> 0.7349696
> > > treatment:gender:phase       2     0.311    0.919      4     20
> 0.4721498
> > > hour                         1     0.933   24.315      4      7
> 0.0003345
> > ***
> > > treatment:hour               2     0.316    0.376      8     16
> 0.9183275
> > > gender:hour                  1     0.339    0.898      4      7
> 0.5129764
> > > treatment:gender:hour        2     0.570    0.798      8     16
> 0.6131884
> > > phase:hour                   1     0.560    0.478      8      3
> 0.8202673
> > > treatment:phase:hour         2     0.662    0.248     16      8
> 0.9915531
> > > gender:phase:hour            1     0.712    0.925      8      3
> 0.5894907
> > > treatment:gender:phase:hour  2     0.793    0.328     16      8
> 0.9723693
> > > ---
> > > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> > >
> > >
> > >
> > > #---code end
> > >
> > > #---code end
> > >
> > > #---code end
> > >
> > >
> > >
> > > Thanks in advance for your help!
> > > Tal Galili
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Jan 25, 2009 at 3:08 AM, John Fox <jfox at mcmaster.ca> wrote:
> > >
> > >
> > >       Dear Peter and Nils,
> > >
> > >       In my initial message, I stated misleadingly that the
> contrast
> > coding
> > > didn't
> > >       matter for the "type-III" tests here since there is just one
> > >       between-subjects factor, but that's not right: The between
> type-III
> > SS
> > > is
> > >       correct using contr.treatment(), but the within SS is not. As
> is
> > > generally
> > >       the case, to get reasonable type-III tests (i.e., tests of
> > reasonable
> > >       hypotheses), it's necessary to have contrasts that are
> orthogonal
> > in
> > > the
> > >       row-basis of the design, such as contr.sum(),
> contr.helmert(), or
> > >       contr.poly(). The "type-II" tests, however, are insensitive
> to the
> > > contrast
> > >       parametrization. Anova() always uses an orthogonal
> parametrization
> > for
> > > the
> > >       within-subjects design.
> > >
> > >       The general advice in ?Anova is, "Be very careful in
> formulating
> > the
> > > model
> > >       for type-III tests, or the hypotheses tested will not make
> sense."
> > >
> > >       Thanks, Peter, for pointing this out.
> > >
> > >
> > >       John
> > >
> > >       ------------------------------
> > >       John Fox, Professor
> > >       Department of Sociology
> > >       McMaster University
> > >       Hamilton, Ontario, Canada
> > >       web: socserv.mcmaster.ca/jfox
> > >
> > >
> > >       > -----Original Message-----
> > >
> > >       > From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk]
> > >       > Sent: January-24-09 6:31 PM
> > >       > To: Nils Skotara
> > >       > Cc: John Fox; r-help at r-project.org; 'Michael Friendly'
> > >       > Subject: Re: [R] Anova and unbalanced designs
> > >       >
> > >
> > >       > Nils Skotara wrote:
> > >       > > Dear John,
> > >       > >
> > >       > > thank you again! You replicated the type III result I got
> in
> > SPSS!
> > > When
> > >       I
> > >       > > calculate Anova() type II:
> > >       > >
> > >       > > Univariate Type II Repeated-Measures ANOVA Assuming
> Sphericity
> > >       > >
> > >       > >                     SS num Df Error SS den Df      F
> Pr(>F)
> > >       > > between         4.8000      1   9.0000      8 4.2667
> 0.07273 .
> > >       > > within          0.2000      1  10.6667      8 0.1500
> 0.70864
> > >       > > between:within  2.1333      1  10.6667      8 1.6000
> 0.24150
> > >       > > ---
> > >       > > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1
> ' ' 1
> > >       > >
> > >       > > I see the exact same values as you had written.
> > >       > > However, and now I am really lost, type III (I did not
> change
> > > anything
> > >       > else)
> > >       > > leads to the following:
> > >       > >
> > >       > > Univariate Type III Repeated-Measures ANOVA Assuming
> Sphericity
> > >       > >
> > >       > >                               SS num Df Error SS den Df
> F
> > >       Pr(>F)
> > >       > > (Intercept)               72.000      1    9.000      8
> 64.0000
> > >       4.367e-05
> > >       > ***
> > >       > > between                    4.800      1    9.000      8
> 4.2667
> > >       0.07273 .
> > >       > > as.factor(within)          2.000      1   10.667      8
> 1.5000
> > >       0.25551
> > >       > > between:as.factor(within)  2.133      1   10.667      8
> 1.6000
> > >       0.24150
> > >       > > ---
> > >       > > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1
> ' ' 1
> > >       > >
> > >       > > How is this possible?
> > >       >
> > >       > This looks like a contrast parametrization issue: If we
> look at
> > the
> > >       > per-group mean within-differences and their SE, we get
> > >       >
> > >       >  > summary(lm(within1-within2~between - 1))
> > >       > ..
> > >       > Coefficients:
> > >       >           Estimate Std. Error t value Pr(>|t|)
> > >       > between1  -1.0000     0.8165  -1.225    0.256
> > >       > between2   0.3333     0.6667   0.500    0.631
> > >       > ..
> > >       >  > table(between)
> > >       > between
> > >       > 1 2
> > >       > 4 6
> > >       >
> > >       > Now, the type II F test is based on weighting the two means
> as
> > you
> > > would
> > >       > after testing for no interaction
> > >       >
> > >       >  > (4*-1+6*.3333)^2/(4^2*0.8165^2+6^2*0.6667^2)
> > >       > [1] 0.1500205
> > >       >
> > >       > and type III is to weight them as if there had been equal
> counts
> > >       >
> > >       >  > (5*-1+5*.3333)^2/(5^2*0.8165^2+5^2*0.6667^2)
> > >       > [1] 0.400022
> > >       >
> > >       > However, the result above corresponds to looking at group1
> only
> > >       >
> > >       >  > (-1)^2/(0.8165^2)
> > >       > [1] 1.499987
> > >       >
> > >       > It helps if you choose orhtogonal contrast
> parametrizations:
> > >       >
> > >       >  > options(contrasts=c("contr.sum","contr.helmert"))
> > >       >  > betweenanova <- lm(values ~ between)>
> Anova(betweenanova,
> > > idata=with,
> > >       > idesign= ~as.factor(within), type = "III" )
> > >       >
> > >       > Type III Repeated Measures MANOVA Tests: Pillai test
> statistic
> > >       >                            Df test stat approx F num Df den
> Df
> > > Pr(>F)
> > >       > (Intercept)                1     0.963  209.067      1
> 8
> > 5.121e-
> > > 07
> > >       ***
> > >       > between                    1     0.348    4.267      1
> 8
> > > 0.07273 .
> > >       > as.factor(within)          1     0.048    0.400      1
> 8
> > > 0.54474
> > >       > between:as.factor(within)  1     0.167    1.600      1
> 8
> > > 0.24150
> > >       > ---
> > >       > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
> ' 1
> > >       >
> > >       >
> > >       >
> > >       >
> > >       > --
> > >       >     O__  ---- Peter Dalgaard             Øster Farimagsgade
> 5,
> > Entr.B
> > >       >    c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014
> Cph. K
> > >       >   (*) \(*) -- University of Copenhagen   Denmark      Ph:
> (+45)
> > > 35327918
> > >       > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX:
> (+45)
> > > 35327907
> > >
> > >       ______________________________________________
> > >       R-help at r-project.org mailing list
> > >       https://stat.ethz.ch/mailman/listinfo/r-help
> > >       PLEASE do read the posting guide http://www.R-
> project.org/posting-
> > > guide.html
> > >       and provide commented, minimal, self-contained, reproducible
> code.
> > >
> > >
> > >
> > >
> > >
> > > --
> > > ----------------------------------------------
> > >
> > >
> > > My contact information:
> > > Tal Galili
> > > Phone number: 972-50-3373767
> > > FaceBook: Tal Galili
> > > My Blogs:
> > > www.talgalili.com
> > > www.biostatistics.co.il
> > >
> > >
> >
> >
> >
> >
>
>
> --
> ----------------------------------------------
>
>
> My contact information:
> Tal Galili
> Phone number: 972-50-3373767
> FaceBook: Tal Galili
> My Blogs:
> www.talgalili.com
> www.biostatistics.co.il
>
>       [[alternative HTML version deleted]]