[R] Testing significance in a design with unequal but proportional sample sizes

Fri Mar 5 00:08:09 CET 2004

Hello,

This is a follow up on the question about the analysis of unbalanced
data, based on my (limited) understanding of what goes in such cases.

When the data is unbalanced in a factorial design,
the main effect of a given factor can be defined in several ways.
Which type of main effet is relevant depends on the scientific question.

Some textbooks distinguish between weighted and unweighted mean effects.

If you use the 'aov' function with an unbalanced design, it will report 
(for the
first factor in the formula), the f-ratio associated to  the "weighted
means" solution. That is, the computation of the main effect ignores the
unbalance: The effect size of a factor 'a' is computed regardless of the
distributions of the units among other factors.

Consider:

 > x<-scan()
1: 1 2 3
4: 4 5 6 7 8
9: 1 2 3 4 5
14: 6 7 8
17:
Read 16 items
 > a<-factor(rep(c(1,2),c(8,8)))
 > b<-factor(rep(c(1,2,1,2),c(3,5,5,3)))
 >
 > tapply(x,list(a=a,b=b),mean)
   b
a   1 2
  1 2 6
  2 3 7
 > tapply(x,a,mean)
  1   2
4.5 4.5

If all units are given the same weights (that is we ignore the factor 'b'),
then the main effect of a is 0.
This is confirmed by:

 > summary(aov(x~a*b))
            Df    Sum Sq   Mean Sq   F value    Pr(>F)
a            1 2.417e-32 2.417e-32 1.209e-32 1.0000000
b            1        60        60        30 0.0001413 ***
a:b          1 5.621e-31 5.621e-31 2.810e-31 1.0000000
Residuals   12        24         2

This is called the weighted means approach because the subgroups defined 
by the
crossing of a and b are given weights proportional the their size.

Now, another approach is to forget about the individual units
and just consider the table of means:

 > tapply(x,list(a=a,b=b),mean)
   b
a   1 2
  1 2 6
  2 3 7

Forgetting about the samples' sizes, one way to defined the main effect 
of 'a'
is as the mean of 2 and 6 versus the mean of 3 and 7:

 > t=tapply(x,list(a=a,b=b),mean)
 > diff(apply(t,1,mean))
2
1

That is '1'

One can compute a "fake" Mean Square associated to 'a' as 
(n-1)*effect-size=15*1=15,
and compare it to the MSE from the previous ANOVA (2 with 12 d.f.)

The f-ratio=15/2=7.5 reaches significance:
 > pf(7.5,1,12)
[1] 0.9820225
 >

If I am correct, this is what textbooks call the "unweighted means" 
approach.
In many cases, it is this type of main effect which is relevant.
(Especialy when the unbalance is due to random missing observations.)

I do not know if there is a solution with R
for easily computing the unweigthed main effects and assessing
their significance. (Anyone?)

Actually, the different types of main effects defined above just 
correspond to different
contrasts on the cell means. So if there is an easy solution to compute 
arbitrary contrasts
on the cell means in a factorial design, this could an approach to this
question. (Anyone?)

Christophe