[R] Lack of independence in anova()
Spencer Graves
spencer.graves at pdf.com
Wed Jul 6 19:08:29 CEST 2005
I'm confused: I understood Doug to be describing a traditional,
normal theory ANOVA with k rows in the table for different effects plus
a (k+1)st for residuals and the mean squares (MS) column are all
indpendent chi-squares scaled by the same unknown sigma^2. The k
statistics in the F column are ratios of independent chi-squares with
the same independent denominator chi-square. How can they be indpendent?
spencer graves
p.s. How is the method of synchronized permutations relevant to a
traditional, normal theory ANOVA?
Phillip Good wrote:
> Do you or Lumley have a citation for this conclusion? Most people go
> forward with the ANOV on the basis that the various tests are independent.
>
> Phillip Good
> P.S. Tests based on the method of synchronized permutations are
> independent.
>
> -----Original Message-----
> From: Douglas Bates [mailto:dmbates at gmail.com]
> Sent: Monday, July 04, 2005 1:44 PM
> To: pigood at verizon.net
> Cc: r-help
> Subject: Re: [R] Lack of independence in anova()
>
>
> I wrote up a note on how to do a more efficient simulation of the
> p-values from anova then discovered to my surprise that the chi-square
> test for independence of the significance of the F-tests indicated
> that they were not independent. I was stumped by this but fortunately
> Thomas Lumley came to my rescue with an explanation. There is no
> reason why the results of the F tests should be independent. The
> numerators are independent but the denominator is the same for both
> tests. When, due to random variation, the denominator is small, then
> the p-values for both tests will tend to be small. If, instead of
> F-tests you use chi-square tests then you do see independence.
>
> Here is the note on the simulation.
>
> There are several things that could be done to speed the simulation of
> the p-values of an anova like this under the null distribution.
>
> If you examine the structure of a fitted lm object (use the function
> str()) you will see that there are components called `qr', `effects'
> and `assign'. You can verify by experimentation that `qr' and
> `assign' depend only on the experimental design. Furthermore, the
> `effects' vector can be reproduced as qr.qty(qrstr, y).
>
> The sums of squares for the different terms in the model are the sums
> of squares of elements of the effects vector as indexed by the assign
> vector. The residual sum of squares is the sum of squares of the
> remaining elements of the assign vector. You can generate 10000
> replications of the calculations of the relevant sums of squares as
>
>
>>set.seed(1234321)
>>vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18))
>>vv
>
> c r
> 1 1 1
> 2 1 1
> 3 1 1
> 4 2 1
> 5 2 1
> 6 2 1
> 7 3 1
> 8 3 1
> 9 3 1
> 10 1 2
> 11 1 2
> 12 1 2
> 13 2 2
> 14 2 2
> 15 2 2
> 16 3 2
> 17 3 2
> 18 3 2
>
>>fm1 <- lm(rnorm(18) ~ c*r, vv)
>>fm1$assign
>
> [1] 0 1 1 2 3 3
>
>>asgn <- c(fm1$assign, rep(4, 12))
>>system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2,
>
> asgn, sum)))
> [1] 20.61 0.01 20.61 0.00 0.00
>
>>res[,1:6]
>
> [,1] [,2] [,3] [,4] [,5] [,6]
> 0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023 2.63392426
> 1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199 3.32972093
> 2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963 0.03411672
> 3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018 4.53895212
> 4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118
>
> The rows of that array correspond to the sum of squares for the
> Intercept (which we generally ignore), the factor 'c', the factor 'r',
> their interaction and the residuals.
>
> As you can see that took about 21 seconds on my system, which I expect
> is a bit faster than your simulation ran.
>
> Because I set the seed to a known value I can reproduce the results
> for the first few simulations to check that the sums of squares are
> correct. Remember that the original fit (fm1) is not included in the
> table.
>
>
>>set.seed(1234321)
>>fm1 <- lm(rnorm(18) ~ c*r, vv)
>>anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
>
> Analysis of Variance Table
>
> Response: rnorm(18)
> Df Sum Sq Mean Sq F value Pr(>F)
> c 2 0.5825 0.2913 0.3801 0.6917
> r 1 0.2613 0.2613 0.3410 0.5701
> c:r 2 2.6260 1.3130 1.7137 0.2215
> Residuals 12 9.1943 0.7662
>
> You can continue this process if you wish further verification that
> the results correspond to the fitted models.
>
> You can get the p-values for the F-tests as
>
>
>>pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low =
>
> FALSE),
> + pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low =
> FALSE),
> + pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low =
> FALSE))
>
> Again you can check this for the first few by hand.
>
>
>>pvalsF[1:5,]
>
> pc pr pint
> 1 0.69171238 0.57006574 0.2214847
> 2 0.37102129 0.03975286 0.1158059
> 3 0.28634939 0.26771167 0.1740633
> 4 0.57775850 0.13506561 0.8775828
> 5 0.06363138 0.16124100 0.1224806
>
> If you wish you could then check marginal distributions using
> techniques like an empirical density plot.
>
>
>>library(lattice)
>>densityplot(~ pc, pvals)
>
>
> At this point I would recommend checking the joint distribution but if
> you want to choose a specific level and check the contingency table
> that could be done as
>
>
>>xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
>
> I(pint < 0.16)
> I(pr < 0.16) FALSE TRUE
> FALSE 7204 1240
> TRUE 1215 341
>
> The summary method for an xtabs object provides a test of independence
>
>
>>summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
>
> Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF)
> Number of cases in table: 10000
> Number of factors: 2
> Test for independence of all factors:
> Chisq = 51.6, df = 1, p-value = 6.798e-13
>
> for which you can see the puzzling result. However, if you use the
> chisquared test based on the known residual variance of 1, you get
>
>
>>pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
>
> + pr = pchisq(res[3,], 1, low = FALSE),
> + pint = pchisq(res[4,], 2, low = FALSE))
>
>>pvalsC[1:5,]
>
> pc pr pint
> 1 0.7473209 0.60924741 0.2690144
> 2 0.4781553 0.05642013 0.1694446
> 3 0.5692081 0.45933855 0.4392846
> 4 0.7706757 0.28059805 0.9418946
> 5 0.5523983 0.53843951 0.6525907
>
>>xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
>
> I(pint < 0.16)
> I(pr < 0.16) FALSE TRUE
> FALSE 7121 1319
> TRUE 1324 236
>
>>summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
>
> Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC)
> Number of cases in table: 10000
> Number of factors: 2
> Test for independence of all factors:
> Chisq = 0.25041, df = 1, p-value = 0.6168
>
>
>
> On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
>
>>To my surprise, the R functions I employed did NOT verify the independence
>>property. I've no quarrel with the theory--it's the R functions that
>
> worry
>
>>me.
>>
>>pG
>>
>>-----Original Message-----
>>From: Douglas Bates [mailto:dmbates at gmail.com]
>>Sent: Monday, July 04, 2005 9:13 AM
>>To: pigood at verizon.net; r-help
>>Subject: Re: [R] Lack of independence in anova()
>>
>>
>>I have already had email exchanges off-list with Phillip Good pointing
>>out that the independence property that he wishes to establish by
>>simulation is a consequence of orthogonality of the column span of the
>>row contrasts and the interaction contrasts. If you set the contrasts
>>option to a set of orthogonal contrasts such as contr.helmert or
>>contr.sum, which has no effect on the results of the anova, this is
>>easily established.
>>
>>
>>>build
>>
>>function(size, v = rnorm(sum(size))) {
>> col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
>> rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
>> row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
>> +size[7]+size[8]))
>> return(data.frame(c=factor(col), r=factor(row),yield=v))
>>}
>>
>>>size
>>
>>[1] 3 3 3 0 3 3 3 0
>>
>>>set.seed(1234321)
>>>vv <- build(size)
>>>vv
>>
>> c r yield
>>1 0 0 1.2369081
>>2 0 0 1.5616230
>>3 0 0 1.8396185
>>4 1 0 0.3173245
>>5 1 0 1.0715115
>>6 1 0 -1.1459955
>>7 2 0 0.2488894
>>8 2 0 0.1158625
>>9 2 0 2.6200816
>>10 0 1 1.2624048
>>11 0 1 -0.9862578
>>12 0 1 -0.3235653
>>13 1 1 0.2039706
>>14 1 1 -1.4574576
>>15 1 1 1.9158713
>>16 2 1 -2.0333909
>>17 2 1 1.0050272
>>18 2 1 0.6789184
>>
>>>options(contrasts = c('contr.helmert', 'contr.poly'))
>>>crossprod(model.matrix(~c*r, vv))
>>
>> (Intercept) c1 c2 r1 c1:r1 c2:r1
>>(Intercept) 18 0 0 0 0 0
>>c1 0 12 0 0 0 0
>>c2 0 0 36 0 0 0
>>r1 0 0 0 18 0 0
>>c1:r1 0 0 0 0 12 0
>>c2:r1 0 0 0 0 0 36
>>
>>
>>On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
>>
>>> If the observations are normally distributed and the 2xk design is
>>>balanced, theory requires that the tests for interaction and row
>
> effects
>
>>be
>>
>>>independent. In my program, appended below, this would translate to
>
> cntT
>
>>>(approx)= cntR*cntI/N if all R routines were functioning correctly.
>
> They
>
>>>aren't.
>>>
>>>sim2=function(size,N,p){
>>> cntR=0
>>> cntC=0
>>> cntI=0
>>> cntT=0
>>> cntP=0
>>> for(i in 1:N){
>>> #generate data
>>> v=gendata(size)
>>> #analyze after build(ing) design containing data
>>> lm.out=lm(yield~c*r,build(size,v))
>>> av.out=anova(lm.out)
>>> #if column effect is significant, increment cntC
>>> if (av.out[[5]][1]<=p)cntC=cntC+1
>>> #if row effect is significant, increment cntR
>>> if (av.out[[5]][2]<=p){
>>> cntR=cntR+1
>>> tmp = 1
>>> }
>>> else tmp =0
>>> if (av.out[[5]][3]<=p){
>>> #if interaction is significant, increment cntI
>>> cntI=cntI+1
>>> #if both interaction and row effect are significant, increment
>>
>>cntT
>>
>>> cntT=cntT + tmp
>>> }
>>> }
>>> list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
>>>}
>>>
>>>build=function(size,v){
>>>#size is a vector containing the sample sizes
>>>col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
>>>rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
>>>row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
>>>+size[7]+size[8]))
>>>return(data.frame(c=factor(col), r=factor(row),yield=v))
>>>}
>>>
>>>gendata=function(size){
>>> ssize=sum(size);
>>> return (rnorm(ssize))
>>>}
>>>
>>>#Example
>>> size=c(3,3,3,0,3,3,3,0)
>>> sim2(size,10000,10,.16)
>>>
>>>
>>>
>>>Phillip Good
>>>Huntington Beach CA
>>>
>>>
>>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide!
>>
>>http://www.R-project.org/posting-guide.html
>>
>>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide!
>
> http://www.R-project.org/posting-guide.html
>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
--
Spencer Graves, PhD
Senior Development Engineer
PDF Solutions, Inc.
333 West San Carlos Street Suite 700
San Jose, CA 95110, USA
spencer.graves at pdf.com
www.pdf.com <http://www.pdf.com>
Tel: 408-938-4420
Fax: 408-280-7915
More information about the R-help
mailing list