[R] Right shift for normality
(Ted Harding)
Ted.Harding at nessie.mcc.ac.uk
Tue Mar 30 02:27:58 CEST 2004
On 29-Mar-04 gwiggner at lix.polytechnique.fr wrote:
> Hello,
>
> My data is discrete, taking values between around -5 and +5.
> I cannot give bounds for the values. So I consider it as
> numerical data and not categorical data.
>
> The histogram has a 'normal' shape, so I test for normality
> via a chisquare statistic (by calculating the expected
> values by hand).
>
> When I use the sample mean and variance, the normality hypothesis
> has to be rejected.
> But when I test for sample mean + a small epsilon, I get very high
> p-values.
>
> I am not sure if this right shift is a good idea.
> Any suggestions?
I suspect that what you are seeing here corresponds to the following.
Because your data are discrete, you are treating them as though
they are "binned" values of an underlying continuous distribution
when you approach the goodness-of-fit using a chisquared measure.
At the same time, because you are using the sample mean and variance
to estimate the parameters of this distribution, you are behaving
as though the discrete values are the exact values of the continuous
variable.
To be consistent, if treating the observed values as "binned" values,
the estimate you should be using for the mean and variance of the
underlying normal distribution should take account of the grouping.
There could be two main approaches to adopt here.
1. Minimum-chisquared: The chisquared value is the sum (O-E)^2/E
where each E is calculated as n*(integral over the range).
Minimise this with respect to mu and sigma^2.
2. Maximum likelihood: The likelihood is the product of P^r where
P is the integral over the range and r is the count in the range.
Maximise this with respect to mu and sigma^2.
Neither of these estimates will give exactly the sample mean
and sample variance as estimates of mean and variance of the
underlying distribution.
Therefore, if the data you have do in fact correspond very closely
to binned values from a normal distribution, the fit you get by
using sample mean and sample variance as estimates will not be
the best fit, and (if you have enough data) the discrepancy may
well be big enough to give a significantly large chisquared.
But it could be (as you appear to have observed) that simply
shifting the sample mean gives you a fit which is closer to
the fit you would get from (1) or (2) (though I would also have
expected it to improve if you slightly reduced sigma^2 as well).
There is a nice paper from quite long ago by Dennis Lindley
which discusses very closely related issues:
Lindley, D.V. (1950). Grouping corrections and maximum likelihood
equations. Proceedings of the Cambridge Philosophical Society,
vol. 46, 106-110.
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 30-Mar-04 Time: 00:27:58
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list