[R] Question about Kolmogorov-Smirnov test behavior

peter dalgaard pdalgd at gmail.com
Thu Jan 7 15:29:13 CET 2016


On 07 Jan 2016, at 14:09 , Shea Lutton <shea at eagleseven.com> wrote:

> Dear R-Help,
>       I am trying to understand the output of the KS test on a pair of files. I am trying to determine if the CDF of one distribution is less than (to the left of) the CDF of a second distribution. My problem is that regardless of whether I run A against B, or B against A, the KS output seems to indicate significance that A is less than B AND B is less than A. Can anybody help me understand where my mistake is or if I am misinterpreting the results? 
> 
> 
> Here is my code:
> 
> file_a = readLines("./file_a.txt")
> file_b = readLines("./file_b.txt")
> a <- as.numeric(file_a)
> b <- as.numeric(file_b)
> ks.test(b, a, alternative = "less")
> ks.test(a, b, alternative = "less")
> 
> 
> And here is the output:
> 
> 	Two-sample Kolmogorov-Smirnov test
> 
> data:  b and a
> D^- = 0.087769, p-value < 2.2e-16
> alternative hypothesis: the CDF of x lies below that of y
> 
> 	Two-sample Kolmogorov-Smirnov test
> 
> data:  a and b
> D^- = 0.085083, p-value < 2.2e-16
> alternative hypothesis: the CDF of x lies below that of y
> 
>> plot(ecdf(a), col = "blue")
>> plot(ecdf(b), add = TRUE, col = "red", lty = 1, pch = 26)
>> plot(density(a))
>> lines(density(b), col = "red")
> 
> 
> My data files can be found here, they are simple columns of numbers. 
>     file_a.txt : http://pastebin.com/e3bmnEDt
>     file_b.txt : http://pastebin.com/5VBzHRXZ
> 


This effect can be generated quite easily by simulation:

> a <- rnorm(1000) ; b <-rnorm(1000, sd=10)
> ks.test(a, b, alternative="less")

	Two-sample Kolmogorov-Smirnov test

data:  a and b
D^- = 0.394, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies below that of y

> ks.test(b, a, alternative="less")

	Two-sample Kolmogorov-Smirnov test

data:  b and a
D^- = 0.412, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies below that of y


The cause should be quite apparent if you do

 plot(ecdf(b))
 plot(ecdf(a), add=T)

and 

plot(function(x)ecdf(a)(x)-ecdf(b)(x), from=-10, to=10)

The basic point is that since KS looks at a maximum difference, two CDFs may deviate in bothe the positive and the negative direction at the same time.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list