[R] normality test

Fri Apr 29 17:05:40 CEST 2005

On 28-Apr-05 Pieter Provoost wrote:
> Thanks all for your comments and hints. I will try to
> keep them in mind.
> Since a number of people asked me what I'm trying to do:
> I want to apply Bayesian inference to a simple ecological
> model I wrote, and therefore I need to fit (uniform, normal
> or lognormal) distributions to sets of observed data
> (to derive mean and sd). You probably have noticed that I'm
> quite new to statistics, but I'm working on that...
> 
> Pieter

And please continue to do so!

Let me try to be constructive. It is clearly established that
the data you posted are far from Normally distributed. The
simple qqnorm plot shows that immediately, and if you need it
the shapiro.test() with "p-value = 8.499e-11" settles it!

Going, however, a bit further, and looking at qqnorm(log(X))
(X being what I call your data series) suggests that it
departs systematically from a pure logNormal at least at the
6 highest values of X. And again, shapiro,test(log(X)) gives

  p-value = 0.00965

which is again a fairly strong indication.

Now, going back to your statement above, that you wrote a
"simple ecological model", I would like to know more about
that before proceeding further.

The rather clear break in slope in qqnorm(log(X)) suggests
to me the possibility that your data may represent a mixture
of two distinct, possibly though not necessarily logNormal,
distributions, one having a much longer upper tail than the
other but being a relative small proportion (say 1/3).

For example, with X denoting your data, compare

  qqnorm(log(X))

with

  set.seed(52341);Y1<-exp(rnorm(22,-3.26,0.69));
  Y2<-exp(rnorm(10,-1.75,2.35))
  qqnorm(log(c(Y1,Y2)))

They are not dissimilar (and I have not been trying very hard).

Another thing to look at is simply

  hist(log(X),breaks=0.5*(-12:4)

This also shows some interesting features: the very high peak
between -3.0 and -2.5 (and possibly an unduly high value between
-3.5 and -3.0), together with a rather thin and widely spread
upper tail above -2.0.

This could be quite consistent with the kind of mixture described
above, or could be due to observer error/bias in measurement.

In any case, it is clear that there is more than a simple
"(uniform, normal or lognormal)" distribution at play here.

In a real investigation, I would at this stage be concerned
to develop a realistic model of how the data are generated.

You do not say what these data represent.

Ths above was mostly written before you posted your second
email, explaining that

  "The Bayesian methods I (will) use are implemented in the
   modelling environment I'm using (FEMME). I'm supervised
   by the person that developed the environment, and she
   asked me to fit a normal or lognormal distribution to
   the observed data. The parameters of that distribution
   will then be used for the Bayesian analysis. So I suppose
   my supervisor knows what very well what she's doing, even
   though I don't (well... not yet)."

It may be speculated whether your supervisor has herself
seriously questioned the structure of these data, since what
she is asking you to do seems to presume that the above is
not relevant!

However, a mixture model would fit nicely into a Bayesian
framework, since (from the above) I suspect a simulation
or MCMC procedure will depend on the parameters to be
estimated for the distribution. For the mixture (e.g.
log(X) is a mixture of two normal distrbutions), you can
estimate the two parameters for each normal distribution
and the proportions p:(1-p) of each. Then, in sampling
from the mixture you first decide on component 1 with
probability p or component 2 with probability q = (1-p),
then sample from the corresonding lognormal distribution.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 29-Apr-05                                       Time: 15:41:36
------------------------------ XFMail ------------------------------