(PR#1007) [Rd] ks.test doesn't compute correct empirical distribution if there are ties in the data

mcdowella@mcdowella.demon.co.uk mcdowella@mcdowella.demon.co.uk
Tue, 3 Jul 2001 06:45:52 +0200 (MET DST)

In message <Pine.GSO.4.31.0107010731110.7616-100000@auk.stats>, Prof
Brian D Ripley <ripley@stats.ox.ac.uk> writes
>You do realize that the Kolmogorov tests (and the Kolmogorov-Smirnov
>extension) assume continuous distributions, so the distribution theory
>is not valid in this case?
>S-PLUS does stop you doing this:
>> ks.gof(o, dist="binomial", size=100, prob=0.25)
>Problem in not.cont1(ttest = d.test, nx = nx, alt.ex..: For testing
>discrete distributions when sample size > 50, use the
>       Chi-square test

Thank you for your prompt reply to my bug report. While I agree that
the distribution theory for the Kolmogorov tests assumes a continuous
distribution, I would like to request a modification to the 
existing routines. The purpose of this would be to provide a result 
that would represent a conservative test in the case when the underlying
distribution is discrete.

This would be in accord with P 432 of the 3rd edition of "Practical
Nonparametric Statistics", by Conover, and section 25.38 of "Kendall's
Advanced Theory of Statistics, 6th Edition, Vol 2A", by Stewart, Ord,
and Arnold, both of which refer to Noether (1963) "Note on the
Kolmogorov Statistic in the discrete case", Metrika, 7, 115. Users
reared on these and similar textbooks would be less surprised at the
behaviour of R if this modification was made, whereas users who do
not attempt to apply the Kolmogorov-Smirnov test to discrete
distributions would not notice any difference.

It would also be in accord with the behaviour of R in the two-sample
case, where the effect of the existing code seems to be to provide
a conservative test (since the statistic returned is no larger than
might be returned in any possible tie-breaking) coupled with a warning,
(to which I would have no objection in the one-sample case).

It seems to me that the following modification would suffice: replace

        x <- y(sort(x), ...) - (0 : (n-1)) / n


        x <- sort(x)
        untied <- c(x[1:n-1] != x[2:n], TRUE)
        x <- y(x, ...) - (0 : (n-1)) / n
        x <- x[untied]

Users dealing with data derived from continuous distributions would
not see any difference, because (except with very small probability 
due to floating point inaccuracy) they would never produce tied data.
A. G. McDowell

r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch