[R] two-sample KS test: data becomes significantly different after normalization

Monnand monnand at gmail.com
Tue Jan 13 04:14:26 CET 2015


Thank you, Chris!

I think it is exactly the problem you mentioned. I did consider
1000-point data is a large one at first.

I down-sampled the data from 1000 points to 100 points and ran KS test
again. It worked as expected. Is there any typical method to compare
two large samples? I also tried KL diverge, but it only gives me some
number but does not tell me how large the distance is should be
considered as significantly different.

Regards,
-Monnand

On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at med.umich.edu> wrote:
>
> The main issue is that the original distributions are the same, you shift the two samples *by different amounts* (about 0.01 SD), and you have a large (n=1000) sample size.  Thus the new distributions are not the same.
>
> This is a problem with testing for equality of distributions.  With large samples, even a small deviation is significant.
>
> Chris
>
> -----Original Message-----
> From: Monnand [mailto:monnand at gmail.com]
> Sent: Sunday, January 11, 2015 10:13 PM
> To: r-help at r-project.org
> Subject: [R] two-sample KS test: data becomes significantly different after normalization
>
> Hi all,
>
> This question is sort of related to R (I'm not sure if I used an R function
> correctly), but also related to stats in general. I'm sorry if this is
> considered as off-topic.
>
> I'm currently working on a data set with two sets of samples. The csv file
> of the data could be found here: http://pastebin.com/200v10py
>
> I would like to use KS test to see if these two sets of samples are from
> different distributions.
>
> I ran the following R script:
>
> # read data from the file
>> data = read.csv('data.csv')
>> ks.test(data[[1]], data[[2]])
>     Two-sample Kolmogorov-Smirnov test
>
> data:  data[[1]] and data[[2]]
> D = 0.025, p-value = 0.9132
> alternative hypothesis: two-sided
> The KS test shows that these two samples are very similar. (In fact, they
> should come from same distribution.)
>
> However, due to some reasons, instead of the raw values, the actual data
> that I will get will be normalized (zero mean, unit variance). So I tried
> to normalize the raw data I have and run the KS test again:
>
>> ks.test(scale(data[[1]]), scale(data[[2]]))
>     Two-sample Kolmogorov-Smirnov test
>
> data:  scale(data[[1]]) and scale(data[[2]])
> D = 0.3273, p-value < 2.2e-16
> alternative hypothesis: two-sided
> The p-value becomes almost zero after normalization indicating these two
> samples are significantly different (from different distributions).
>
> My question is: How the normalization could make two similar samples
> becomes different from each other? I can see that if two samples are
> different, then normalization could make them similar. However, if two sets
> of data are similar, then intuitively, applying same operation onto them
> should make them still similar, at least not different from each other too
> much.
>
> I did some further analysis about the data. I also tried to normalize the
> data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
> same thing happened. At first, I thought it might be outliers caused this
> problem (I can see that an outlier may cause this problem if I normalize
> the data into [0,1] range.) I deleted all data whose abs value is larger
> than 4 standard deviation. But it still didn't help.
>
> Plus, I even plotted the eCDFs, they *really* look the same to me even
> after normalization. Anything wrong with my usage of the R function?
>
> Since the data contains ties, I also tried ks.boot (
> http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> result.
>
> Could anyone help me to explain why it happened? Also, any suggestion about
> the hypothesis testing on normalized data? (The data I have right now is
> simulated data. In real world, I cannot get raw data, but only normalized
> one.)
>
> Regards,
> -Monnand
>
>         [[alternative HTML version deleted]]
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues



More information about the R-help mailing list