[Rd] ks.test - The two-sample two-sided Kolmogorov-Smirnov test with ties (PR#13848)
tpw at google.com
tpw at google.com
Wed Jul 22 22:45:12 CEST 2009
Full_Name: Thomas Waterhouse
Version: 2.9.1
OS: OS X 10.5.7
Submission from: (NULL) (216.239.45.4)
ks.test uses a biased approximation to the p-value in the case of the two-sample
test with ties in the pooled data. This has been justified in R in the past by
the argument that the KS test assumes a continuous distribution. But the
two-sample test can be extended to arbitrary distributions by a combinatorial
argument (below), which I would like to see implemented in R.
The p-value of the test is simply the probability that a random assignment of
the pooled data to two sets of the sizes of the input data sets has a KS test
statistic at least as great as that of the input data. The function psmirnov2x
in ks.c calculates this probability by enumerating all such assignments as
lattice walks, except that it doesn't know how to handle tied data points.
The correct procedure can be deduced by considering that steps in such lattice
walks represent steps along the x-axis in computing the empirical CDFs of the
two data sets. The trick is to consider all occurrences of a repeated data
point together, when determining whether a path exceeds the input statistic or
not. A FORTRAN implementation of this algorithm was published by Nikiforov
(http://www.ams.sunysb.edu/~nkfr/5PUBL.HTM), but it's not for the faint of
heart.
It would be straightforward to change psmirnov2x to include this logic, but I
found it easier simply to rewrite it (with comments this time!) as follows:
void
psmirnov2x(double *x, Sint *m, Sint *n, double *pool)
{
/* Inputs: *x is the value of the KS test statistic (from 0.0 to 1.0)
*m and *n are the lengths of each data set
*pool is an array of length *m + *n formed by combining the two
data sets and sorting in increasing order
Output: *x is unity minus the p-value of the test
Notes: The p-value of the two-sample KS test is the proportion of all
assignments of the pooled data to sets of sizes *m and *n which
have test statistic exceeding the test statistic of the actual
data.
To calculate this, we interpret each assignment as a lattice walk
from the origin to (*m, *n), where the i-th step represents the
i-th
data point in the pooled data set. A step to the right represents
assigning a data point to the *m set, and a step up represents
assigning a data point to the *n set. The difference between the
empirical CDFs of each set can be computed at each step, and the
KS test statistic is simply the maximum of this value along a given
walk.
The p-value is obtained by counting the number of walks that do
not exceed the input value (since this is easier than counting the
number that do) and dividing by the total number of walks; this is
what we return to the caller.
If the same value occurs multiple times in the pooled data, we
need to count all multiplicites at the same time when computing the
empirical CDFs. In terms of the lattice walk, this means we look
only the final occurence of each value when determining whether or
not a walk exceeds the input statistic.
*/
double md, nd, q, *u, w;
Sint i, j;
if (*x == 0.0) {
return 1.0;
}
md = (double) (*m);
nd = (double) (*n);
q = floor(*x * md * nd - 1e-7) / (md * nd); /* just *x minus an epsilon */
u = (double *) R_alloc(*n + 1, sizeof(double));
for (i = 0; i <= *m; i++) {
for (j = 0; j <= *n; j++) {
/* At each step, rescale by w := j/(j+*m); this is equivalent to
dividing by C(*m+*n, *m) at the end, but without the risk of
overflow */
w = (double)(j) / (double)(j + *m);
if (i + j == 0) {
/* Start with 1 walk from the origin to the origin */
u[j] = 1.0;
} else if (i + j < *m + *n &&
pool[i + j - 1] != pool[i + j] &&
fabs(i / md - j / nd) > q) {
/* This walk meets or exceeds *x, so don't count it */
u[j] = 0.0;
} else if (i == 0) {
/* The *n-edge: The number of walks to (0, j) is the same as
the number of walks to (0, j-1) */
u[j] = u[j - 1] * w;
} else if (j == 0) {
/* The *m-edge: The number of walks to (i, 0) is the same as
the number of walks to (i-1, 0) */
/* no-op */
} else {
/* The general case: The number of walks to (i, j) is the
number of walks to (i-1, j) plus the number of walks to
(i, j-1) */
u[j] = u[j] + u[j - 1] * w;
}
}
}
/* Return the proportion of walks that do not exceed the input statistic:
*/
*x = u[*n];
}
ks.test.R must also be changed as per the following diff, which includes a
corrected calculation of the test statistic:
51d50
< TIES <- FALSE
54,59c53,57
< z <- cumsum(ifelse(order(w) <= n.x, 1 / n.x, - 1 / n.y))
< if(length(unique(w)) < (n.x + n.y)) {
< warning("cannot compute correct p-values with ties")
< z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)]
< TIES <- TRUE
< }
---
> steps = which(diff(sort(w)) != 0)
> if(length(steps) > 0)
> z <- cumsum(ifelse(order(w) <= n.x, 1 / n.x, - 1 / n.y))[steps]
> else
> z <- c(0)
68c66
< if(exact && (alternative == "two.sided") && !TIES)
---
> if(exact && (alternative == "two.sided"))
72a71
> sort(w),
Finally, the changed signature of psmirnov2x needs to be reflected in ctest.h
and init.c.
I have built and tested the above implementation against 20 pairs of data sets
for which I calculated the test statistics and p-values by brute force or
Nikiforov's algorithm. I can provide these test sets if desired.
More information about the R-devel
mailing list