[R] unbalanced one-way ANOVA

Douglas Bates bates at stat.wisc.edu
Fri Feb 29 15:38:14 CET 2008


On Fri, Feb 29, 2008 at 4:47 AM, Nauta, A.L. <A.L.Nauta at students.uu.nl> wrote:

> Thank you for your reply,
> is your answer (that the approach does not depend on balance in the data)
> only valid for one-way anova, or also for two-way or more-way anova?

Any kind.

You should be aware that for unbalanced data sets the sum of squares
attributed to a term depends on the order in which the terms occur in
the model.  That is, the sum of squares and the F-ratios and the
p-values for, say, factor A will be different if you fit a model

y ~ A + B

versus the model

y ~ B + A

to a data set where factors A and B are unbalanced.

This is because the sums of squares displayed by R's anova methods are
the sequential sums of squares.  Although other statistical software
may calculate other, more exotic, types of sums of squares, many of us
would argue that these are the only ones that make sense.

If in doubt about which sum of squares to use, the general rule is
that you should only pay attention to the F ratio and p-value for the
last term in the model.

>  ________________________________
>  From: dmbates at gmail.com on behalf of Douglas Bates
> Sent: Fri 29-2-2008 0:39
> To: Nauta, A.L.
> Cc: r-help at r-project.org
> Subject: Re: [R] unbalanced one-way ANOVA
>
>
>
>
>
> On Thu, Feb 28, 2008 at 7:52 AM, Nauta, A.L. <A.L.Nauta at students.uu.nl>
> wrote:
> > Hi,
>
> >  I have an unbalanced dataset on which I would like to perform a one-way
> anova test using R (aov). According to Wannacott and Wannacott (1990) p.
> 333, one-way anova with unbalanced data is possible with a few modifications
> in the anova-calculations. The modified anova calculations should take into
> account different sample sizes and a modified definition of the average. I
> was wondering if the aov-function in R is suitable for one-way anova on
> unbalanced data.
>
> Yes.
>
> The analysis of variance is performed in R by fitting a linear model
> created from indicator variables for the levels of the factor.  This
> validity of this approach does not depend on balance in the data.
>
> The formulas given in an introductory textbook are almost never the
> way that results are computed in practice.  I think we would all be
> better off if they didn't even give these misleading formulas.
>



More information about the R-help mailing list