[R] KS test from ctest package

Fri Apr 9 03:49:03 CEST 1999

This question is mainly aimed at Kurt Hornik as author of the ctest package,
but I'm cc'ing it to r-help as I suspect there will be other valuable
opinions out there.

I have been attempting 2 sample Kolmogorov-Smirnov tests using the ks.test
function from the ctest package (ctest v.0.9-15, R v.0.63.3 win32).  I am
comparing fish length-frequency distributions.  My main reference for the KS 
test at present is Sokal & Rohlf, Biometry (2nd edn), pages 440-445).

The individuals in my samples are measured to the nearest 0.5cm and so in
most samples there are several identical length values.  It appears that the
KS test statistic D is being overestimated (and the p value therefore
underestimated).
I think this is best illustrated by a trivial (but extreme) example:

	> library(ctest)
	> x <- y <- rep(1,10)
	> ks.test(x,y)

	         Two-sample Kolmogorov-Smirnov test 

	data:  x and y 
	D = 1, p-value = 9.08e-005 
	alternative hypothesis: two.sided 

Obviously when two identical vectors are compared the test statistic D should
be zero and the probability that the two vectors represent the same
underlying 
distribution should be 1.

If D is calculated using the first method outlined by Sokal & Rohlf (maximum
absolute difference between relative cumulative frequencies) then D is
indeed 0.
The method used in the ctest code is presented by Sokal and Rohlf as an
alternate (NB not approximate) computation scheme and attributed to Gideon
& Mueller (1978).  The pertinent code is the line:

        z <- ifelse(order(c(x, y)) <= n.x, 1/n.x, -1/n.y).

If the two vectors in the example above had been identical, but with no
repeated values, the result of order(c(x, y)) would have been along the
lines of

 [1]  1 11  2 12  3 13  4 14  5 15  6 16  7 17  8 18  9 19 10 20

(the essential point being that items in the result come alternately
from x & y).  D is calculated as max(abs(cumsum(z))), with the result that
the minimum D for identical vectors is min(1/n.x,1/n.y).  (It therefore
appears
to me that this computational method should be considered an approximate
rather
than alternative method.)

In the case of vectors with replicated values the problem 
is worse because values from one vector are grouped in the vector returned
by order.  In the case of the example above:

> order(c(x, y))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

I don't think this can be considered a bug, but it is certainly a problem
for the method used in computing D.  Has anyone coded alternative KS test
computation methods in R/S?  It's obviously not hard, but could be slow unless
done elegantly!

Thanks

David Middleton

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._