[R] Lack of independence in anova()

Mon Jul 4 22:44:19 CEST 2005

I wrote up a note on how to do a more efficient simulation of the
p-values from anova then discovered to my surprise that the chi-square
test for independence of the significance of the F-tests indicated
that they were not independent.  I was stumped by this but fortunately
Thomas Lumley came to my rescue with an explanation.  There is no
reason why the results of the F tests should be independent.  The
numerators are independent but the denominator is the same for both
tests.  When, due to random variation, the denominator is small, then
the p-values for both tests will tend to be small.  If, instead of
F-tests you use chi-square tests then you do see independence.

Here is the note on the simulation.

There are several things that could be done to speed the simulation of
the p-values of an anova like this under the null distribution.

If you examine the structure of a fitted lm object (use the function
str()) you will see that there are components called `qr', `effects'
and `assign'.  You can verify by experimentation that `qr' and
`assign' depend only on the experimental design.  Furthermore, the
`effects' vector can be reproduced as qr.qty(qrstr, y).

The sums of squares for the different terms in the model are the sums
of squares of elements of the effects vector as indexed by the assign
vector.  The residual sum of squares is the sum of squares of the
remaining elements of the assign vector.  You can generate 10000
replications of the calculations of the relevant sums of squares as

> set.seed(1234321)
> vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18))
> vv
   c r
1  1 1
2  1 1
3  1 1
4  2 1
5  2 1
6  2 1
7  3 1
8  3 1
9  3 1
10 1 2
11 1 2
12 1 2
13 2 2
14 2 2
15 2 2
16 3 2
17 3 2
18 3 2
> fm1 <- lm(rnorm(18) ~ c*r, vv)
> fm1$assign
[1] 0 1 1 2 3 3
> asgn <- c(fm1$assign, rep(4, 12))
> system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2, asgn, sum)))
[1] 20.61  0.01 20.61  0.00  0.00
> res[,1:6]
       [,1]      [,2]     [,3]      [,4]       [,5]        [,6]
0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023  2.63392426
1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199  3.32972093
2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963  0.03411672
3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018  4.53895212
4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118

The rows of that array correspond to the sum of squares for the
Intercept (which we generally ignore), the factor 'c', the factor 'r',
their interaction and the residuals.

As you can see that took about 21 seconds on my system, which I expect
is a bit faster than your simulation ran.

Because I set the seed to a known value I can reproduce the results
for the first few simulations to check that the sums of squares are
correct.  Remember that the original fit (fm1) is not included in the
table.

> set.seed(1234321)
> fm1 <- lm(rnorm(18) ~ c*r, vv)
> anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
Analysis of Variance Table

Response: rnorm(18)
          Df Sum Sq Mean Sq F value Pr(>F)
c          2 0.5825  0.2913  0.3801 0.6917
r          1 0.2613  0.2613  0.3410 0.5701
c:r        2 2.6260  1.3130  1.7137 0.2215
Residuals 12 9.1943  0.7662               

You can continue this process if you wish further verification that
the results correspond to the fitted models.

You can get the p-values for the F-tests as

> pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low = FALSE),
+                      pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low = FALSE),
+                      pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low = FALSE))

Again you can check this for the first few by hand.

> pvalsF[1:5,]
          pc         pr      pint
1 0.69171238 0.57006574 0.2214847
2 0.37102129 0.03975286 0.1158059
3 0.28634939 0.26771167 0.1740633
4 0.57775850 0.13506561 0.8775828
5 0.06363138 0.16124100 0.1224806

If you wish you could then check marginal distributions using
techniques like an empirical density plot.

> library(lattice)
> densityplot(~ pc, pvals)

At this point I would recommend checking the joint distribution but if
you want to choose a specific level and check the contingency table
that could be done as 

> xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
            I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
       FALSE  7204 1240
       TRUE   1215  341

The summary method for an xtabs object provides a test of independence

> summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF)
Number of cases in table: 10000 
Number of factors: 2 
Test for independence of all factors:
	Chisq = 51.6, df = 1, p-value = 6.798e-13

for which you can see the puzzling result.  However, if you use the
chisquared test based on the known residual variance of 1, you get

> pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
+                      pr = pchisq(res[3,], 1, low = FALSE),
+                      pint = pchisq(res[4,], 2, low = FALSE))
> pvalsC[1:5,]
         pc         pr      pint
1 0.7473209 0.60924741 0.2690144
2 0.4781553 0.05642013 0.1694446
3 0.5692081 0.45933855 0.4392846
4 0.7706757 0.28059805 0.9418946
5 0.5523983 0.53843951 0.6525907
> xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
            I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
       FALSE  7121 1319
       TRUE   1324  236
> summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC)
Number of cases in table: 10000 
Number of factors: 2 
Test for independence of all factors:
	Chisq = 0.25041, df = 1, p-value = 0.6168

On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
> To my surprise, the R functions I employed did NOT verify the independence
> property.  I've no quarrel with the theory--it's the R functions that worry
> me.
> 
> pG
> 
> -----Original Message-----
> From: Douglas Bates [mailto:dmbates at gmail.com]
> Sent: Monday, July 04, 2005 9:13 AM
> To: pigood at verizon.net; r-help
> Subject: Re: [R] Lack of independence in anova()
> 
> 
> I have already had email exchanges off-list with Phillip Good pointing
> out that the independence property that he wishes to establish by
> simulation is a consequence of orthogonality of the column span of the
> row contrasts and the interaction contrasts.  If you set the contrasts
> option to a set of orthogonal contrasts such as contr.helmert or
> contr.sum, which has no effect on the results of the anova, this is
> easily established.
> 
> > build
> function(size, v = rnorm(sum(size))) {
>     col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
>     rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
>     row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
>     +size[7]+size[8]))
>     return(data.frame(c=factor(col), r=factor(row),yield=v))
> }
> > size
> [1] 3 3 3 0 3 3 3 0
> > set.seed(1234321)
> > vv <- build(size)
> > vv
>    c r      yield
> 1  0 0  1.2369081
> 2  0 0  1.5616230
> 3  0 0  1.8396185
> 4  1 0  0.3173245
> 5  1 0  1.0715115
> 6  1 0 -1.1459955
> 7  2 0  0.2488894
> 8  2 0  0.1158625
> 9  2 0  2.6200816
> 10 0 1  1.2624048
> 11 0 1 -0.9862578
> 12 0 1 -0.3235653
> 13 1 1  0.2039706
> 14 1 1 -1.4574576
> 15 1 1  1.9158713
> 16 2 1 -2.0333909
> 17 2 1  1.0050272
> 18 2 1  0.6789184
> > options(contrasts = c('contr.helmert', 'contr.poly'))
> > crossprod(model.matrix(~c*r, vv))
>             (Intercept) c1 c2 r1 c1:r1 c2:r1
> (Intercept)          18  0  0  0     0     0
> c1                    0 12  0  0     0     0
> c2                    0  0 36  0     0     0
> r1                    0  0  0 18     0     0
> c1:r1                 0  0  0  0    12     0
> c2:r1                 0  0  0  0     0    36
> 
> 
> On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
> >  If the observations are normally distributed and the 2xk design is
> > balanced,  theory requires that the tests for interaction and row effects
> be
> > independent.  In my program, appended below, this would translate to cntT
> > (approx)= cntR*cntI/N if all R routines were functioning correctly.  They
> > aren't.
> >
> > sim2=function(size,N,p){
> >   cntR=0
> >   cntC=0
> >   cntI=0
> >   cntT=0
> >   cntP=0
> >   for(i in 1:N){
> >     #generate data
> >      v=gendata(size)
> >     #analyze after build(ing) design containing data
> >      lm.out=lm(yield~c*r,build(size,v))
> >      av.out=anova(lm.out)
> >     #if column effect is significant, increment cntC
> >      if (av.out[[5]][1]<=p)cntC=cntC+1
> >    #if row effect is significant, increment cntR
> >      if (av.out[[5]][2]<=p){
> >            cntR=cntR+1
> >            tmp = 1
> >            }
> >      else tmp =0
> >      if (av.out[[5]][3]<=p){
> >            #if interaction is significant, increment cntI
> >             cntI=cntI+1
> >         #if both interaction and row effect are significant, increment
> cntT
> >             cntT=cntT + tmp
> >             }
> >      }
> >     list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
> > }
> >
> > build=function(size,v){
> > #size is a vector containing the sample sizes
> > col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
> > rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
> > row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
> > +size[7]+size[8]))
> > return(data.frame(c=factor(col), r=factor(row),yield=v))
> > }
> >
> > gendata=function(size){
> >   ssize=sum(size);
> >   return (rnorm(ssize))
> > }
> >
> > #Example
> >  size=c(3,3,3,0,3,3,3,0)
> >  sim2(size,10000,10,.16)
> >
> >
> >
> > Phillip Good
> > Huntington Beach CA
> >
> >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>