[Rd] The two chisq.test p values differ when the contingency table (PR#3896)

Thu Aug 21 21:20:43 MEST 2003

>>>>> dmurdoch  writes:

>> Date: Wed, 16 Jul 2003 01:27:25 +0200 (MET DST)
>> From: shitao at ucla.edu

>>> x
>> [,1] [,2]
>> [1,]  149  151
>> [2,]    1    8
>>> c2x<-chisq.test(x, simulate.p.value=T, B=100000)$p.value
>>> for(i in (1:20)){c2x<-c(c2x,chisq.test(x,
>> simulate.p.value=T,B=100000)$p.value)}
>>> c2tx<-chisq.test(t(x), simulate.p.value=T, B=100000)$p.value
>>> for(i in (1:20)){c2tx<-c(c2tx,chisq.test(t(x), simulate.p.value=T,
>> + B=100000)$p.value)}
>>> cbind(c2x,c2tx)
>> c2x    c2tx
>> [1,] 0.03711 0.01683
>> [2,] 0.03717 0.01713

> The problem is in ctest/R/chisq.test.R, where the p-value is
> calculated as 

>             STATISTIC <- sum((x - E) ^ 2 / E)
>             PARAMETER <- NA
>             PVAL <- sum(tmp$results >= STATISTIC) / B

> Here tmp$results is a collection of simulated chisquare values, but
> because of different rounding, the statistics corresponding to tables
> equal to the observed table are slightly smaller than the value
> calculated in STATISTIC, and effectively the p-value is calcuated as

>              PVAL <- sum(tmp$results > STATISTIC) / B

> instead.

> What's the appropriate fix here? 

> PVAL <- sum(tmp$results > STATISTIC - .Machine$double.eps^0.5) / B

> works on this example, but is there something better?

Argh.  Very interesting ...

I think it works to use

            STATISTIC <- sum(sort((x - E) ^ 2 / E, decreasing = TRUE))

instead: this starts by summing the big values, and hence if at all
slightly 'underestimates' the real value, which is fine for the
comparisons.

Fix committed to r-devel.  Thanks for looking into this.

-k