[R] Kolmogorov-Smirnov test

Greg Snow Greg.Snow at imail.org
Fri Apr 29 23:34:55 CEST 2011

The general idea of the KS test (and others) can be applied to discrete data, but the implementation in R assumes continuous data (does not have the needed adjustments to deal with ties).  The chi-square and other tests suffer from the same problems in your case.  In all cases the null hypothesis is that the data comes from the stated distribution (poisson in your case), failing to reject the null hypothesis does not prove that the data comes from that distribution, only shows that we cannot disprove that it comes from that distribution.  With large sample sizes, your data could come from a true distribution that for all practical purposes is equivalent to the poisson, but due to slight rounding or other errors has probabilities slightly different for some values (a difference that no one would reasonably care about), but these tests can show a significant difference.

Usually it is better to just show that your data and the theoretical distribution are close enough to each other rather than depending on a formal test.  The plots and diagnostics in the vcd package are a good choice here, you could also use the KS test statistic (ignoring the p-value and warnings) as another measure, but plot the empirical and theoretical distributions to see what the value means and how close they are.

Another option is the vis.test function in TeachingDemos, it lets you plot data from the theoretical distribution and the actual data, then see if you can visually tell the difference.

Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of m.marcinmichal
> Sent: Thursday, April 28, 2011 3:54 PM
> To: r-help at r-project.org
> Subject: Re: [R] Kolmogorov-Smirnov test
> Hi,
> thanks for response.
> >> The Kolmogorov-Smirnov test is designed for distributions on
> continuous
> >> variable, not discrete like the >> poisson.  That is why you are
> getting
> >> some of your warnings.
> I read in "Fitting distributions whith R" Vito Ricci page 19  that:
> "...
> Kolmogorov-Smirnov test is used to decide if a sample comes from a
> population with a specific distribution. I can be applied both for
> discrete
> (count) data and continuous binned (even if some Authors do not agree
> on
> this point) and both for continuous variables" but in page 16 i read
> that
> "... while the Kolmogorov-Smirnov and Anderson-Darling tests are
> restricted
> to continuous distribution" and i was little confused, but try this
> test to
> my discrete data.
> Generally in first step, I try fit my data to discret or continuous
> distribution (task: find distribution for emirical data). Question, Can
> I
> approximate my discret data by the continuous  distribution? I know
> that
> sometmies we can poisson distribution approxime by the normal
> distribution.
> But what happen if I use another distribution like log normall or gama?
> I done another three tests - chi square test. But this tests return
> three
> another results. Suppose that we have the same data i.e vectorSentence.
> Test:
> 1. One
> param <- fitdistr(vectorSentence, "poisson")
> chisq.test(table(vectorSentence), p = dpois(1:9, lambda=param[[1]][1]),
> rescale.p = TRUE)
> X-squared = 272.8958, df = 8, p-value < 2.2e-16
> 2. Two
> library(vcd)
> gf <- goodfit(vectorSentence, type="poisson", method="MinChisq")
> summary(gf)
>              X^2 df     P(> X^2)
> Pearson 404.3607  8 2.186332e-82
> 3. Three
> fdistc <- fitdist(vectorSentence, "pois")
> g<-gofstat(fdistc, print.test = TRUE)
> Chi-squared statistic:  535.344
> Degree of freedom of the Chi-squared distribution:  8
> Chi-squared p-value:  1.824112e-110
> Question which results is correct?
> I know that I can reject null hipotesis: data don't come from poisson
> distribution. But which result is correct?
> For another side I trying to accomplish another problem:
> 1. Suppose that we have a reference data (dr) from some process (pr)
> which
> save in vectorSentence.
> 2. Suppose that we have a two another sample data d1, d2 from another
> two
> process p1, p2
> 3. We know that all data is discrete.
> Task:
> One: check if data d1, d2 is equal to reference data (dr) - this is not
> a
> problem. I use a cdf, histogram, another mensure etc. chi square test.
> But
> can I use Kolmogorov-Smirnov  to test cumulative distribution function
> hipotesis i.e F(d1) = F(d) for my data?
> Two: find dr distributions discret or if possible continuous
> Best
> Marcin M.
> --
> View this message in context: http://r.789695.n4.nabble.com/Kolmogorov-
> Smirnov-test-tp3479506p3482349.html
> Sent from the R help mailing list archive at Nabble.com.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list