[R] ks.test() output interpretation

Tue Jun 28 17:32:28 CEST 2005

Hi

I would recommend graphical methods to compare two samples from 
possible different distributions. See ?qqplot
Since the Kolmogorov-Smirnov test has in many cases very small
power, you can not conclude that two sample come from the same
distribution only because the ks.test is not significant.

The following example shows you one problem:
In a short simulation we generate 1000 times two samples (with
100 observation per sample). The first sample has a standard
normal distribution, the second a t-distribution with 1 degree
of freedom. For each of these 1000 pairs we calculate the
ks.test and save the p.value.

x1 <- matrix(nrow = 100, ncol = 1000)
y1 <- matrix(nrow = 100, ncol = 1000)
test1 <- numeric(1000)
for(i in 1:1000) {
  set.seed(i)
  x1[,i] <- rnorm(100)
  y1[,i] <- rt(100, df = 1)
  test1[i] <- ks.test(x1[,i],y1[,i])$p.value
}
sum(test1<0.05)

Only in 309 of 1000 cases the test shows a significant
difference of the two samples. In all other cases we would
conclude that the two sample have the same distribution.
This is an example with 100 observation per group. If you have
smaller groups the power is even worse.

If we look at 10 randomly drawn pairs of the 1000 simulations
and plot the qqplot:

par(mfrow = c(3,3))
ind <- sample(1:1000, 9)
tmp <- sapply(ind, function(j) qqplot(x1[,j],y1[,j], xlab = paste("x1[,",j,"]"),
                                      ylab = paste("y1[,",j,"]")))

In many cases we see that the two distributions are
different. Compare it to the qqplot of two normal distributed
random variables:

x2 <- matrix(rnorm(900), nrow = 100, ncol = 9)
y2 <- matrix(rnorm(900), nrow = 100, ncol = 9)
par(mfrow = c(3,3))
tmp <- sapply(1:9, function(j) qqplot(x2[,j],y2[,j], xlab = paste("x2[,",j,"]"),
                                      ylab = paste("y2[,",j,"]")))

Of course there are situations for which the graphical methods
fail, too, but it becomes apparent that it is a descriptive way
to describe two distributions.
Calculating the Kolmogorov-Smirnov test pretends a clear test
result (that two distribution are the same) which is wrong or at
least misleading.

Best regards,

Christoph Buser

--------------------------------------------------------------
Christoph Buser <buser at stat.math.ethz.ch>
Seminar fuer Statistik, LEO C13
ETH (Federal Inst. Technology)	8092 Zurich	 SWITZERLAND
phone: x-41-44-632-4673		fax: 632-1228
http://stat.ethz.ch/~buser/
--------------------------------------------------------------

kapo coulibaly writes:
 > I'm using ks.test() to compare two different
 > measurement methods. I don't really know how to
 > interpret the output in the absence of critical value
 > table of the D statistic. I guess I could use the
 > p-value when available. But I also get the message
 > "cannot compute correct p-values with ties ..." does
 > it mean I can't use ks.test() for these data or I can
 > still use the D statistic computed to make a decision
 > whether the two samples come from the same
 > distribution.
 > 
 > Thanks!!
 > 
 > 
 > 		
 > ____________________________________________________ 
 > 
 > Rekindle the Rivalries. Sign up for Fantasy Football
 > 
 > ______________________________________________
 > R-help at stat.math.ethz.ch mailing list
 > https://stat.ethz.ch/mailman/listinfo/r-help
 > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html