(PR#1007) [Rd] ks.test doesn't compute correct empirical

ripley@stats.ox.ac.uk ripley@stats.ox.ac.uk
Tue, 3 Jul 2001 07:18:26 +0200 (MET DST)

On Tue, 3 Jul 2001, A. G. McDowell wrote:

> In message <Pine.GSO.4.31.0107010731110.7616-100000@auk.stats>, Prof
> Brian D Ripley <ripley@stats.ox.ac.uk> writes
> >
> >You do realize that the Kolmogorov tests (and the Kolmogorov-Smirnov
> >extension) assume continuous distributions, so the distribution theory
> >is not valid in this case?
> >
> >S-PLUS does stop you doing this:
> >
> >> ks.gof(o, dist="binomial", size=100, prob=0.25)
> >Problem in not.cont1(ttest = d.test, nx = nx, alt.ex..: For testing
> >discrete distributions when sample size > 50, use the
> >       Chi-square test
> >
> Thank you for your prompt reply to my bug report. While I agree that
> the distribution theory for the Kolmogorov tests assumes a continuous
> distribution, I would like to request a modification to the
> existing routines. The purpose of this would be to provide a result
> that would represent a conservative test in the case when the underlying
> distribution is discrete.
> This would be in accord with P 432 of the 3rd edition of "Practical
> Nonparametric Statistics", by Conover, and section 25.38 of "Kendall's
> Advanced Theory of Statistics, 6th Edition, Vol 2A", by Stewart, Ord,
> and Arnold, both of which refer to Noether (1963) "Note on the
> Kolmogorov Statistic in the discrete case", Metrika, 7, 115. Users
> reared on these and similar textbooks would be less surprised at the
> behaviour of R if this modification was made, whereas users who do
> not attempt to apply the Kolmogorov-Smirnov test to discrete
> distributions would not notice any difference.

(Hopefully readers of those textbooks would understand that the results you
reported as a bug *are* the behaviour of KS test.  Nowhere does R
say it has implemented a modified KS test.  The one data point we have
suggests otherwise ....)

> It would also be in accord with the behaviour of R in the two-sample
> case, where the effect of the existing code seems to be to provide
> a conservative test (since the statistic returned is no larger than
> might be returned in any possible tie-breaking) coupled with a warning,
> (to which I would have no objection in the one-sample case).
> It seems to me that the following modification would suffice: replace
>         x <- y(sort(x), ...) - (0 : (n-1)) / n
> with
>         x <- sort(x)
>         untied <- c(x[1:n-1] != x[2:n], TRUE)
>         x <- y(x, ...) - (0 : (n-1)) / n
>         x <- x[untied]

In your original examples, this reduces a sample of size 10000 to one of
size 101 or 2.  Conservative - yes.   Useful - very unlikely!

> Users dealing with data derived from continuous distributions would
> not see any difference, because (except with very small probability
> due to floating point inaccuracy) they would never produce tied data.

There are circumstances in which one would want the original KS definition
for all data sets, where one wnats the test value and not the p value.

I've added a warning, but I do not think we should be implementing a
different definition.

Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch