[R] Lack of independence in anova()

Wed Jul 6 19:08:29 CEST 2005

	  I'm confused:  I understood Doug to be describing a traditional, 
normal theory ANOVA with k rows in the table for different effects plus 
a (k+1)st for residuals and the mean squares (MS) column are all 
indpendent chi-squares scaled by the same unknown sigma^2.  The k 
statistics in the F column are ratios of independent chi-squares with 
the same independent denominator chi-square.  How can they be indpendent?

	  spencer graves
p.s.  How is the method of synchronized permutations relevant to a 
traditional, normal theory ANOVA?

Phillip Good wrote:

> Do you or Lumley have a citation for this conclusion?   Most people go
> forward with the ANOV on the basis that the various tests are independent.
> 
> Phillip Good
> P.S.  Tests based on the method of synchronized permutations are
> independent.
> 
> -----Original Message-----
> From: Douglas Bates [mailto:dmbates at gmail.com]
> Sent: Monday, July 04, 2005 1:44 PM
> To: pigood at verizon.net
> Cc: r-help
> Subject: Re: [R] Lack of independence in anova()
> 
> 
> I wrote up a note on how to do a more efficient simulation of the
> p-values from anova then discovered to my surprise that the chi-square
> test for independence of the significance of the F-tests indicated
> that they were not independent.  I was stumped by this but fortunately
> Thomas Lumley came to my rescue with an explanation.  There is no
> reason why the results of the F tests should be independent.  The
> numerators are independent but the denominator is the same for both
> tests.  When, due to random variation, the denominator is small, then
> the p-values for both tests will tend to be small.  If, instead of
> F-tests you use chi-square tests then you do see independence.
> 
> Here is the note on the simulation.
> 
> There are several things that could be done to speed the simulation of
> the p-values of an anova like this under the null distribution.
> 
> If you examine the structure of a fitted lm object (use the function
> str()) you will see that there are components called `qr', `effects'
> and `assign'.  You can verify by experimentation that `qr' and
> `assign' depend only on the experimental design.  Furthermore, the
> `effects' vector can be reproduced as qr.qty(qrstr, y).
> 
> The sums of squares for the different terms in the model are the sums
> of squares of elements of the effects vector as indexed by the assign
> vector.  The residual sum of squares is the sum of squares of the
> remaining elements of the assign vector.  You can generate 10000
> replications of the calculations of the relevant sums of squares as
> 
> 
>>set.seed(1234321)
>>vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18))
>>vv
> 
>    c r
> 1  1 1
> 2  1 1
> 3  1 1
> 4  2 1
> 5  2 1
> 6  2 1
> 7  3 1
> 8  3 1
> 9  3 1
> 10 1 2
> 11 1 2
> 12 1 2
> 13 2 2
> 14 2 2
> 15 2 2
> 16 3 2
> 17 3 2
> 18 3 2
> 
>>fm1 <- lm(rnorm(18) ~ c*r, vv)
>>fm1$assign
> 
> [1] 0 1 1 2 3 3
> 
>>asgn <- c(fm1$assign, rep(4, 12))
>>system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2,
> 
> asgn, sum)))
> [1] 20.61  0.01 20.61  0.00  0.00
> 
>>res[,1:6]
> 
>        [,1]      [,2]     [,3]      [,4]       [,5]        [,6]
> 0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023  2.63392426
> 1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199  3.32972093
> 2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963  0.03411672
> 3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018  4.53895212
> 4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118
> 
> The rows of that array correspond to the sum of squares for the
> Intercept (which we generally ignore), the factor 'c', the factor 'r',
> their interaction and the residuals.
> 
> As you can see that took about 21 seconds on my system, which I expect
> is a bit faster than your simulation ran.
> 
> Because I set the seed to a known value I can reproduce the results
> for the first few simulations to check that the sums of squares are
> correct.  Remember that the original fit (fm1) is not included in the
> table.
> 
> 
>>set.seed(1234321)
>>fm1 <- lm(rnorm(18) ~ c*r, vv)
>>anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
> 
> Analysis of Variance Table
> 
> Response: rnorm(18)
>           Df Sum Sq Mean Sq F value Pr(>F)
> c          2 0.5825  0.2913  0.3801 0.6917
> r          1 0.2613  0.2613  0.3410 0.5701
> c:r        2 2.6260  1.3130  1.7137 0.2215
> Residuals 12 9.1943  0.7662
> 
> You can continue this process if you wish further verification that
> the results correspond to the fitted models.
> 
> You can get the p-values for the F-tests as
> 
> 
>>pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low =
> 
> FALSE),
> +                      pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low =
> FALSE),
> +                      pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low =
> FALSE))
> 
> Again you can check this for the first few by hand.
> 
> 
>>pvalsF[1:5,]
> 
>           pc         pr      pint
> 1 0.69171238 0.57006574 0.2214847
> 2 0.37102129 0.03975286 0.1158059
> 3 0.28634939 0.26771167 0.1740633
> 4 0.57775850 0.13506561 0.8775828
> 5 0.06363138 0.16124100 0.1224806
> 
> If you wish you could then check marginal distributions using
> techniques like an empirical density plot.
> 
> 
>>library(lattice)
>>densityplot(~ pc, pvals)
> 
> 
> At this point I would recommend checking the joint distribution but if
> you want to choose a specific level and check the contingency table
> that could be done as
> 
> 
>>xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
> 
>             I(pint < 0.16)
> I(pr < 0.16) FALSE TRUE
>        FALSE  7204 1240
>        TRUE   1215  341
> 
> The summary method for an xtabs object provides a test of independence
> 
> 
>>summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
> 
> Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF)
> Number of cases in table: 10000
> Number of factors: 2
> Test for independence of all factors:
> 	Chisq = 51.6, df = 1, p-value = 6.798e-13
> 
> for which you can see the puzzling result.  However, if you use the
> chisquared test based on the known residual variance of 1, you get
> 
> 
>>pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
> 
> +                      pr = pchisq(res[3,], 1, low = FALSE),
> +                      pint = pchisq(res[4,], 2, low = FALSE))
> 
>>pvalsC[1:5,]
> 
>          pc         pr      pint
> 1 0.7473209 0.60924741 0.2690144
> 2 0.4781553 0.05642013 0.1694446
> 3 0.5692081 0.45933855 0.4392846
> 4 0.7706757 0.28059805 0.9418946
> 5 0.5523983 0.53843951 0.6525907
> 
>>xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
> 
>             I(pint < 0.16)
> I(pr < 0.16) FALSE TRUE
>        FALSE  7121 1319
>        TRUE   1324  236
> 
>>summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
> 
> Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC)
> Number of cases in table: 10000
> Number of factors: 2
> Test for independence of all factors:
> 	Chisq = 0.25041, df = 1, p-value = 0.6168
> 
> 
> 
> On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
> 
>>To my surprise, the R functions I employed did NOT verify the independence
>>property.  I've no quarrel with the theory--it's the R functions that
> 
> worry
> 
>>me.
>>
>>pG
>>
>>-----Original Message-----
>>From: Douglas Bates [mailto:dmbates at gmail.com]
>>Sent: Monday, July 04, 2005 9:13 AM
>>To: pigood at verizon.net; r-help
>>Subject: Re: [R] Lack of independence in anova()
>>
>>
>>I have already had email exchanges off-list with Phillip Good pointing
>>out that the independence property that he wishes to establish by
>>simulation is a consequence of orthogonality of the column span of the
>>row contrasts and the interaction contrasts.  If you set the contrasts
>>option to a set of orthogonal contrasts such as contr.helmert or
>>contr.sum, which has no effect on the results of the anova, this is
>>easily established.
>>
>>
>>>build
>>
>>function(size, v = rnorm(sum(size))) {
>>    col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
>>    rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
>>    row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
>>    +size[7]+size[8]))
>>    return(data.frame(c=factor(col), r=factor(row),yield=v))
>>}
>>
>>>size
>>
>>[1] 3 3 3 0 3 3 3 0
>>
>>>set.seed(1234321)
>>>vv <- build(size)
>>>vv
>>
>>   c r      yield
>>1  0 0  1.2369081
>>2  0 0  1.5616230
>>3  0 0  1.8396185
>>4  1 0  0.3173245
>>5  1 0  1.0715115
>>6  1 0 -1.1459955
>>7  2 0  0.2488894
>>8  2 0  0.1158625
>>9  2 0  2.6200816
>>10 0 1  1.2624048
>>11 0 1 -0.9862578
>>12 0 1 -0.3235653
>>13 1 1  0.2039706
>>14 1 1 -1.4574576
>>15 1 1  1.9158713
>>16 2 1 -2.0333909
>>17 2 1  1.0050272
>>18 2 1  0.6789184
>>
>>>options(contrasts = c('contr.helmert', 'contr.poly'))
>>>crossprod(model.matrix(~c*r, vv))
>>
>>            (Intercept) c1 c2 r1 c1:r1 c2:r1
>>(Intercept)          18  0  0  0     0     0
>>c1                    0 12  0  0     0     0
>>c2                    0  0 36  0     0     0
>>r1                    0  0  0 18     0     0
>>c1:r1                 0  0  0  0    12     0
>>c2:r1                 0  0  0  0     0    36
>>
>>
>>On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
>>
>>> If the observations are normally distributed and the 2xk design is
>>>balanced,  theory requires that the tests for interaction and row
> 
> effects
> 
>>be
>>
>>>independent.  In my program, appended below, this would translate to
> 
> cntT
> 
>>>(approx)= cntR*cntI/N if all R routines were functioning correctly.
> 
> They
> 
>>>aren't.
>>>
>>>sim2=function(size,N,p){
>>>  cntR=0
>>>  cntC=0
>>>  cntI=0
>>>  cntT=0
>>>  cntP=0
>>>  for(i in 1:N){
>>>    #generate data
>>>     v=gendata(size)
>>>    #analyze after build(ing) design containing data
>>>     lm.out=lm(yield~c*r,build(size,v))
>>>     av.out=anova(lm.out)
>>>    #if column effect is significant, increment cntC
>>>     if (av.out[[5]][1]<=p)cntC=cntC+1
>>>   #if row effect is significant, increment cntR
>>>     if (av.out[[5]][2]<=p){
>>>           cntR=cntR+1
>>>           tmp = 1
>>>           }
>>>     else tmp =0
>>>     if (av.out[[5]][3]<=p){
>>>           #if interaction is significant, increment cntI
>>>            cntI=cntI+1
>>>        #if both interaction and row effect are significant, increment
>>
>>cntT
>>
>>>            cntT=cntT + tmp
>>>            }
>>>     }
>>>    list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
>>>}
>>>
>>>build=function(size,v){
>>>#size is a vector containing the sample sizes
>>>col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
>>>rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
>>>row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
>>>+size[7]+size[8]))
>>>return(data.frame(c=factor(col), r=factor(row),yield=v))
>>>}
>>>
>>>gendata=function(size){
>>>  ssize=sum(size);
>>>  return (rnorm(ssize))
>>>}
>>>
>>>#Example
>>> size=c(3,3,3,0,3,3,3,0)
>>> sim2(size,10000,10,.16)
>>>
>>>
>>>
>>>Phillip Good
>>>Huntington Beach CA
>>>
>>>
>>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide!
>>
>>http://www.R-project.org/posting-guide.html
>>
>>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide!
> 
> http://www.R-project.org/posting-guide.html
> 
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

-- 
Spencer Graves, PhD
Senior Development Engineer
PDF Solutions, Inc.
333 West San Carlos Street Suite 700
San Jose, CA 95110, USA

spencer.graves at pdf.com
www.pdf.com <http://www.pdf.com>
Tel:  408-938-4420
Fax: 408-280-7915