# [R] two-sample KS test: data becomes significantly different after normalization

Martin Maechler maechler at stat.math.ethz.ch
Wed Jan 14 11:27:10 CET 2015

>>>>> Monnand  <monnand at gmail.com>
>>>>>     on Wed, 14 Jan 2015 07:17:02 +0000 writes:

> I know this must be a wrong method, but I cannot help to ask: Can I only
> use the p-value from KS test, saying if p-value is greater than \beta, then
> two samples are from the same distribution. If the definition of p-value is
> the probability that the null hypothesis is true,

Ouch, ouch, ouch, ouch !!!!!!!!

The worst misuse/misunderstanding of statistics  now even on R-help ...

---> please get help from a statistician !!

--> and erase that sentence from your mind (unless you are pro
and want to keep it for anectdotal or didactical purposes...)

> then why there's little
> people uses p-value as a "true" probability. e.g. normally, people will not
> multiply or add p-values to get the probability that two independent null
> hypothesis are both true or one of them is true. I had this question for
> very long time.

> -Monnand

> On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at med.umich.edu>
> wrote:

>> This sounds more like quality control than hypothesis testing.  Rather
>> than statistical significance, you want to determine what is an acceptable
>> difference (an 'equivalence margin', if you will).  And that is a question
>> about the application, not a statistical one.
>> ________________________________________
>> From: Monnand [monnand at gmail.com]
>> Sent: Monday, January 12, 2015 10:14 PM
>> To: Andrews, Chris
>> Cc: r-help at r-project.org
>> Subject: Re: [R] two-sample KS test: data becomes significantly different
>> after normalization
>>
>> Thank you, Chris!
>>
>> I think it is exactly the problem you mentioned. I did consider
>> 1000-point data is a large one at first.
>>
>> I down-sampled the data from 1000 points to 100 points and ran KS test
>> again. It worked as expected. Is there any typical method to compare
>> two large samples? I also tried KL diverge, but it only gives me some
>> number but does not tell me how large the distance is should be
>> considered as significantly different.
>>
>> Regards,
>> -Monnand
>>
>> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at med.umich.edu>
>> wrote:
>> >
>> > The main issue is that the original distributions are the same, you
>> shift the two samples *by different amounts* (about 0.01 SD), and you have
>> a large (n=1000) sample size.  Thus the new distributions are not the same.
>> >
>> > This is a problem with testing for equality of distributions.  With
>> large samples, even a small deviation is significant.
>> >
>> > Chris
>> >
>> > -----Original Message-----
>> > From: Monnand [mailto:monnand at gmail.com]
>> > Sent: Sunday, January 11, 2015 10:13 PM
>> > To: r-help at r-project.org
>> > Subject: [R] two-sample KS test: data becomes significantly different
>> after normalization
>> >
>> > Hi all,
>> >
>> > This question is sort of related to R (I'm not sure if I used an R
>> function
>> > correctly), but also related to stats in general. I'm sorry if this is
>> > considered as off-topic.
>> >
>> > I'm currently working on a data set with two sets of samples. The csv
>> file
>> > of the data could be found here: http://pastebin.com/200v10py
>> >
>> > I would like to use KS test to see if these two sets of samples are from
>> > different distributions.
>> >
>> > I ran the following R script:
>> >
>> > # read data from the file
>> >> ks.test(data[[1]], data[[2]])
>> >     Two-sample Kolmogorov-Smirnov test
>> >
>> > data:  data[[1]] and data[[2]]
>> > D = 0.025, p-value = 0.9132
>> > alternative hypothesis: two-sided
>> > The KS test shows that these two samples are very similar. (In fact, they
>> > should come from same distribution.)
>> >
>> > However, due to some reasons, instead of the raw values, the actual data
>> > that I will get will be normalized (zero mean, unit variance). So I tried
>> > to normalize the raw data I have and run the KS test again:
>> >
>> >> ks.test(scale(data[[1]]), scale(data[[2]]))
>> >     Two-sample Kolmogorov-Smirnov test
>> >
>> > data:  scale(data[[1]]) and scale(data[[2]])
>> > D = 0.3273, p-value < 2.2e-16
>> > alternative hypothesis: two-sided
>> > The p-value becomes almost zero after normalization indicating these two
>> > samples are significantly different (from different distributions).
>> >
>> > My question is: How the normalization could make two similar samples
>> > becomes different from each other? I can see that if two samples are
>> > different, then normalization could make them similar. However, if two
>> sets
>> > of data are similar, then intuitively, applying same operation onto them
>> > should make them still similar, at least not different from each other
>> too
>> > much.
>> >
>> > I did some further analysis about the data. I also tried to normalize the
>> > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but
>> > same thing happened. At first, I thought it might be outliers caused this
>> > problem (I can see that an outlier may cause this problem if I normalize
>> > the data into [0,1] range.) I deleted all data whose abs value is larger
>> > than 4 standard deviation. But it still didn't help.
>> >
>> > Plus, I even plotted the eCDFs, they *really* look the same to me even
>> > after normalization. Anything wrong with my usage of the R function?
>> >
>> > Since the data contains ties, I also tried ks.boot (
>> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
>> > result.
>> >
>> > Could anyone help me to explain why it happened? Also, any suggestion