[R] Lack of independence in anova()

Wed Jul 6 19:06:45 CEST 2005

On Wed, 6 Jul 2005, Phillip Good wrote:
> Do you or Lumley have a citation for this conclusion?   Most people go
> forward with the ANOV on the basis that the various tests are independent.

I don't have a citation. I would have thought that the claim that they 
were independent was more in need of a citation.  I'd expect that books on 
the classical theory of linear models would show that the row, 
interaction, and residual mean squares are independent, and the lack of 
independence of the tests is then basic probability.  If X, Y, and Z are 
independent and Z takes on more than one value then X/Z and Y/Z can't be 
independent.

> P.S.  Tests based on the method of synchronized permutations are
> independent.

Quite possibly.  If so, they presumably condition on the observed residual 
mean square.

 	-thomas

>
> -----Original Message-----
> From: Douglas Bates [mailto:dmbates at gmail.com]
> Sent: Monday, July 04, 2005 1:44 PM
> To: pigood at verizon.net
> Cc: r-help
> Subject: Re: [R] Lack of independence in anova()
>
>
> I wrote up a note on how to do a more efficient simulation of the
> p-values from anova then discovered to my surprise that the chi-square
> test for independence of the significance of the F-tests indicated
> that they were not independent.  I was stumped by this but fortunately
> Thomas Lumley came to my rescue with an explanation.  There is no
> reason why the results of the F tests should be independent.  The
> numerators are independent but the denominator is the same for both
> tests.  When, due to random variation, the denominator is small, then
> the p-values for both tests will tend to be small.  If, instead of
> F-tests you use chi-square tests then you do see independence.
>
> Here is the note on the simulation.
>
> There are several things that could be done to speed the simulation of
> the p-values of an anova like this under the null distribution.
>
> If you examine the structure of a fitted lm object (use the function
> str()) you will see that there are components called `qr', `effects'
> and `assign'.  You can verify by experimentation that `qr' and
> `assign' depend only on the experimental design.  Furthermore, the
> `effects' vector can be reproduced as qr.qty(qrstr, y).
>
> The sums of squares for the different terms in the model are the sums
> of squares of elements of the effects vector as indexed by the assign
> vector.  The residual sum of squares is the sum of squares of the
> remaining elements of the assign vector.  You can generate 10000
> replications of the calculations of the relevant sums of squares as
>
>> set.seed(1234321)
>> vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18))
>> vv
>   c r
> 1  1 1
> 2  1 1
> 3  1 1
> 4  2 1
> 5  2 1
> 6  2 1
> 7  3 1
> 8  3 1
> 9  3 1
> 10 1 2
> 11 1 2
> 12 1 2
> 13 2 2
> 14 2 2
> 15 2 2
> 16 3 2
> 17 3 2
> 18 3 2
>> fm1 <- lm(rnorm(18) ~ c*r, vv)
>> fm1$assign
> [1] 0 1 1 2 3 3
>> asgn <- c(fm1$assign, rep(4, 12))
>> system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2,
> asgn, sum)))
> [1] 20.61  0.01 20.61  0.00  0.00
>> res[,1:6]
>       [,1]      [,2]     [,3]      [,4]       [,5]        [,6]
> 0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023  2.63392426
> 1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199  3.32972093
> 2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963  0.03411672
> 3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018  4.53895212
> 4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118
>
> The rows of that array correspond to the sum of squares for the
> Intercept (which we generally ignore), the factor 'c', the factor 'r',
> their interaction and the residuals.
>
> As you can see that took about 21 seconds on my system, which I expect
> is a bit faster than your simulation ran.
>
> Because I set the seed to a known value I can reproduce the results
> for the first few simulations to check that the sums of squares are
> correct.  Remember that the original fit (fm1) is not included in the
> table.
>
>> set.seed(1234321)
>> fm1 <- lm(rnorm(18) ~ c*r, vv)
>> anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
> Analysis of Variance Table
>
> Response: rnorm(18)
>          Df Sum Sq Mean Sq F value Pr(>F)
> c          2 0.5825  0.2913  0.3801 0.6917
> r          1 0.2613  0.2613  0.3410 0.5701
> c:r        2 2.6260  1.3130  1.7137 0.2215
> Residuals 12 9.1943  0.7662
>
> You can continue this process if you wish further verification that
> the results correspond to the fitted models.
>
> You can get the p-values for the F-tests as
>
>> pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low =
> FALSE),
> +                      pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low =
> FALSE),
> +                      pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low =
> FALSE))
>
> Again you can check this for the first few by hand.
>
>> pvalsF[1:5,]
>          pc         pr      pint
> 1 0.69171238 0.57006574 0.2214847
> 2 0.37102129 0.03975286 0.1158059
> 3 0.28634939 0.26771167 0.1740633
> 4 0.57775850 0.13506561 0.8775828
> 5 0.06363138 0.16124100 0.1224806
>
> If you wish you could then check marginal distributions using
> techniques like an empirical density plot.
>
>> library(lattice)
>> densityplot(~ pc, pvals)
>
> At this point I would recommend checking the joint distribution but if
> you want to choose a specific level and check the contingency table
> that could be done as
>
>> xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
>            I(pint < 0.16)
> I(pr < 0.16) FALSE TRUE
>       FALSE  7204 1240
>       TRUE   1215  341
>
> The summary method for an xtabs object provides a test of independence
>
>> summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
> Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF)
> Number of cases in table: 10000
> Number of factors: 2
> Test for independence of all factors:
> 	Chisq = 51.6, df = 1, p-value = 6.798e-13
>
> for which you can see the puzzling result.  However, if you use the
> chisquared test based on the known residual variance of 1, you get
>
>> pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
> +                      pr = pchisq(res[3,], 1, low = FALSE),
> +                      pint = pchisq(res[4,], 2, low = FALSE))
>> pvalsC[1:5,]
>         pc         pr      pint
> 1 0.7473209 0.60924741 0.2690144
> 2 0.4781553 0.05642013 0.1694446
> 3 0.5692081 0.45933855 0.4392846
> 4 0.7706757 0.28059805 0.9418946
> 5 0.5523983 0.53843951 0.6525907
>> xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
>            I(pint < 0.16)
> I(pr < 0.16) FALSE TRUE
>       FALSE  7121 1319
>       TRUE   1324  236
>> summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
> Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC)
> Number of cases in table: 10000
> Number of factors: 2
> Test for independence of all factors:
> 	Chisq = 0.25041, df = 1, p-value = 0.6168
>
>
>
> On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
>> To my surprise, the R functions I employed did NOT verify the independence
>> property.  I've no quarrel with the theory--it's the R functions that
> worry
>> me.
>>
>> pG
>>
>> -----Original Message-----
>> From: Douglas Bates [mailto:dmbates at gmail.com]
>> Sent: Monday, July 04, 2005 9:13 AM
>> To: pigood at verizon.net; r-help
>> Subject: Re: [R] Lack of independence in anova()
>>
>>
>> I have already had email exchanges off-list with Phillip Good pointing
>> out that the independence property that he wishes to establish by
>> simulation is a consequence of orthogonality of the column span of the
>> row contrasts and the interaction contrasts.  If you set the contrasts
>> option to a set of orthogonal contrasts such as contr.helmert or
>> contr.sum, which has no effect on the results of the anova, this is
>> easily established.
>>
>>> build
>> function(size, v = rnorm(sum(size))) {
>>     col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
>>     rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
>>     row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
>>     +size[7]+size[8]))
>>     return(data.frame(c=factor(col), r=factor(row),yield=v))
>> }
>>> size
>> [1] 3 3 3 0 3 3 3 0
>>> set.seed(1234321)
>>> vv <- build(size)
>>> vv
>>    c r      yield
>> 1  0 0  1.2369081
>> 2  0 0  1.5616230
>> 3  0 0  1.8396185
>> 4  1 0  0.3173245
>> 5  1 0  1.0715115
>> 6  1 0 -1.1459955
>> 7  2 0  0.2488894
>> 8  2 0  0.1158625
>> 9  2 0  2.6200816
>> 10 0 1  1.2624048
>> 11 0 1 -0.9862578
>> 12 0 1 -0.3235653
>> 13 1 1  0.2039706
>> 14 1 1 -1.4574576
>> 15 1 1  1.9158713
>> 16 2 1 -2.0333909
>> 17 2 1  1.0050272
>> 18 2 1  0.6789184
>>> options(contrasts = c('contr.helmert', 'contr.poly'))
>>> crossprod(model.matrix(~c*r, vv))
>>             (Intercept) c1 c2 r1 c1:r1 c2:r1
>> (Intercept)          18  0  0  0     0     0
>> c1                    0 12  0  0     0     0
>> c2                    0  0 36  0     0     0
>> r1                    0  0  0 18     0     0
>> c1:r1                 0  0  0  0    12     0
>> c2:r1                 0  0  0  0     0    36
>>
>>
>> On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
>>>  If the observations are normally distributed and the 2xk design is
>>> balanced,  theory requires that the tests for interaction and row
> effects
>> be
>>> independent.  In my program, appended below, this would translate to
> cntT
>>> (approx)= cntR*cntI/N if all R routines were functioning correctly.
> They
>>> aren't.
>>>
>>> sim2=function(size,N,p){
>>>   cntR=0
>>>   cntC=0
>>>   cntI=0
>>>   cntT=0
>>>   cntP=0
>>>   for(i in 1:N){
>>>     #generate data
>>>      v=gendata(size)
>>>     #analyze after build(ing) design containing data
>>>      lm.out=lm(yield~c*r,build(size,v))
>>>      av.out=anova(lm.out)
>>>     #if column effect is significant, increment cntC
>>>      if (av.out[[5]][1]<=p)cntC=cntC+1
>>>    #if row effect is significant, increment cntR
>>>      if (av.out[[5]][2]<=p){
>>>            cntR=cntR+1
>>>            tmp = 1
>>>            }
>>>      else tmp =0
>>>      if (av.out[[5]][3]<=p){
>>>            #if interaction is significant, increment cntI
>>>             cntI=cntI+1
>>>         #if both interaction and row effect are significant, increment
>> cntT
>>>             cntT=cntT + tmp
>>>             }
>>>      }
>>>     list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
>>> }
>>>
>>> build=function(size,v){
>>> #size is a vector containing the sample sizes
>>> col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
>>> rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
>>> row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
>>> +size[7]+size[8]))
>>> return(data.frame(c=factor(col), r=factor(row),yield=v))
>>> }
>>>
>>> gendata=function(size){
>>>   ssize=sum(size);
>>>   return (rnorm(ssize))
>>> }
>>>
>>> #Example
>>>  size=c(3,3,3,0,3,3,3,0)
>>>  sim2(size,10000,10,.16)
>>>
>>>
>>>
>>> Phillip Good
>>> Huntington Beach CA
>>>
>>>
>>>
>>> ______________________________________________
>>> R-help at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide!
>> http://www.R-project.org/posting-guide.html
>>>
>>>
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle