[R] Question about Kolmogorov-Smirnov test behavior

peter dalgaard pdalgd at gmail.com
Thu Jan 7 15:29:13 CET 2016

On 07 Jan 2016, at 14:09 , Shea Lutton <shea at eagleseven.com> wrote:

> Dear R-Help,
>       I am trying to understand the output of the KS test on a pair of files. I am trying to determine if the CDF of one distribution is less than (to the left of) the CDF of a second distribution. My problem is that regardless of whether I run A against B, or B against A, the KS output seems to indicate significance that A is less than B AND B is less than A. Can anybody help me understand where my mistake is or if I am misinterpreting the results? 
> Here is my code:
> file_a = readLines("./file_a.txt")
> file_b = readLines("./file_b.txt")
> a <- as.numeric(file_a)
> b <- as.numeric(file_b)
> ks.test(b, a, alternative = "less")
> ks.test(a, b, alternative = "less")
> And here is the output:
> 	Two-sample Kolmogorov-Smirnov test
> data:  b and a
> D^- = 0.087769, p-value < 2.2e-16
> alternative hypothesis: the CDF of x lies below that of y
> 	Two-sample Kolmogorov-Smirnov test
> data:  a and b
> D^- = 0.085083, p-value < 2.2e-16
> alternative hypothesis: the CDF of x lies below that of y
>> plot(ecdf(a), col = "blue")
>> plot(ecdf(b), add = TRUE, col = "red", lty = 1, pch = 26)
>> plot(density(a))
>> lines(density(b), col = "red")
> My data files can be found here, they are simple columns of numbers. 
>     file_a.txt : http://pastebin.com/e3bmnEDt
>     file_b.txt : http://pastebin.com/5VBzHRXZ

This effect can be generated quite easily by simulation:

> a <- rnorm(1000) ; b <-rnorm(1000, sd=10)
> ks.test(a, b, alternative="less")

	Two-sample Kolmogorov-Smirnov test

data:  a and b
D^- = 0.394, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies below that of y

> ks.test(b, a, alternative="less")

	Two-sample Kolmogorov-Smirnov test

data:  b and a
D^- = 0.412, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies below that of y

The cause should be quite apparent if you do

 plot(ecdf(a), add=T)


plot(function(x)ecdf(a)(x)-ecdf(b)(x), from=-10, to=10)

The basic point is that since KS looks at a maximum difference, two CDFs may deviate in bothe the positive and the negative direction at the same time.

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

More information about the R-help mailing list