[R] On Corrections for Chi-Sq Goodness of Fit Test
Rolf Turner
rolf.turner at xtra.co.nz
Fri Dec 23 04:56:07 CET 2011
On 20/12/11 10:24, Michael Fuller wrote:
> TOPIC
> My question regards the philosophy behind how R implements corrections to chi-square statistical tests. At least in recent versions (I'm using 2.13.1 (2011-07-08) on OSX 10.6.8.), the chisq.test function applies the Yates continuity correction for 2 by 2 contingency tables. But when used as a goodness of fit test (GoF, aka likelihood ratio test), chisq.test does not appear to implement any corrections for widely recognized problems, such as small sample size, non-uniform expected frequencies, and one D.F.
>
> > From the help page:
> "In the goodness-of-fit case simulation is done by random sampling from the discrete distribution specified by p, each sample being of size n = sum(x)."
>
> Is the thinking that random sampling completely obviates the need for corrections?
Yes.
> Wouldn't the same statistical issues still apply
No.
> (e.g. poor continuity approximation with one D.F.,
There are no degrees of freedom involved. There is no continuity
involved.
The observed test statistics (say "Stat") is compared with a number of
test statistics, Stat_1, ..., Stat_N, calculated from data sets
simulated under
the null hypothesis. If the null is true, then Stat and Stat_1,
...., Stat_N are
all of ``equal status''. If there are m values of the Stat_i which
are greater
than Stat, then the ``probability of observing, under the null
hypothesis,
data as extreme as, or more extreme than, what you actually observed''
is the probability of randomly selecting one of a specified set of
m+1 ``slots''
out of a total of N+1 slots (where each slot has probability 1/(N+1)).
Thus the p-value is (exactly) equal to (m+1)/(N+1).
The only restriction is that there be no ties amongst the values of
Stat
and Stat_1, ..., Stat_N. There being ties is of fairly low
probability, but is
not of zero probability --- since there is a finite number of
possible samples
and hence of statistic values. So this restriction is a mild worry.
However a ``continuity correction'' would be of no help whatsoever.
> problems with non-uniform expected frequencies, etc) with random sampling?
Don't understand what you mean by this.
cheers,
Rolf Turner
More information about the R-help
mailing list