[R] Test if data uniformly distributed (newbie)
Petr Savicky
savicky at praha1.ff.cuni.cz
Sun Jun 12 10:40:37 CEST 2011
On Fri, Jun 10, 2011 at 10:15:36PM +0200, Kairavi Bhakta wrote:
> Thanks for your answer. The reason I want the data to be uniform: It's the
> first step in a machine learning project I am working on. If I know the data
> isn't uniformly distributed, then this means there is probably something
> wrong and the following steps will be biased by the non-uniform input data.
> I'm not checking an assumption for another statistical test.
>
> Actually, the data has been normalized because it is supposed to represent a
> probability distribution. That's why it sums to 1. My assumption is that,
> for a vector of 5, the data at that point should look like 0.20 0.20 0.20
> 0.20 0.20, but of course there is variation, and I would like to test
> whether the data comes close enough or not.
As others told you, this is not the right format for KS test. The words
"testing uniformity" can mean different things and the meaning depends
on which statistical model you assume. If we have a random variable
with values in [0, 1], then testing uniformity means to test, to which
extent its distribution is close to the uniform distribution on [0, 1].
The numbers, which concentrate around 0.2, will not satisfy this.
If we have a discrete variable with k values, for which we have m
independent observations, and the number of observations of value i
is m_i, then it is possible to test, whether the variable has the uniform
distribution on {1, ..., k} using Chi-squared test. Note that for
this test, the original counts are needed, not their normalized values,
which sum up to 1. For example, if we have 20 observations and
the counts (m_1, ..., m_5) are (4, 3, 5, 2, 6), then this is quite
consistent with the assumption of uniform distribution. On the
other hand, if we have 200 observations and the counts are
(40, 30, 50, 20, 60), then the null hypothesis of uniform distribution
may be rejected (the uniform distribution is the default, see argument
p in ?chisq.test)
x <- c(40, 30, 50, 20, 60)
chisq.test(x)
Chi-squared test for given probabilities
data: x
X-squared = 25, df = 4, p-value = 5.031e-05
It is not clear, whether this is suitable for your application.
If you generate the values in a different way, then another
test may be needed. Can you specify more detail on how the
numbers are generated?
Petr Savicky.
More information about the R-help
mailing list