[R] A somewhat off the line question to a log normal distrib

Thu Dec 2 12:30:01 CET 2004

On 02-Dec-04 Siegfried Gonzi wrote:
> Hello:
> 
> Oh yes I know it isn't so much related to R, but I gather
> there are a lot of statisticians reading the mailing list.
> 
> My boss repeatedly tried to explain me the following.
> 
> ==
> Lets assume you have got daily measurements of a variable
> in natural sciences. It turned out that the aformentioned
> daily measurements follow a log-normal distribution when
> considered over the course of a year. 
> Okay. He also tried to explain me that the monthly means
> (based on the daily measurements) must follow a log-normal
> distribution too then over the course of a year.
> ==
> 
> I somehow get his explanation.

Hmm, perhaps you should think again! If X and Y have log-normal
distributions (mathematically exactly), then (X+Y)/2 does not
(mathematically) have a log-normal distribution -- still less
the arithmetic mean of some 30 such variables. So one wonders
what the basis of his "explanation" was.

However, the conclusion would hold for the *geometric* mean
of the variables. X has a log-normal distribution if log(X)
has a normal distribution. So let X, Y, ... be log-normal.
The geometric mean is exp((log(X)+log(Y)+...)/n), and since
log(X), log(Y), ... are normal, so is (log(X)+log(Y)+...)/n,
and so the geometric mean is log-normal.

> But I have measurements which are log-normal distributed
> when evaluated on a daily basis over the course of a year
> but they are close to a Gaussian distribution when considered
> under the light of monthly means over the course of a year.
> 
> Is such a latter case feasible. And if not why.

This is, broadly, to be expected. If X1, X2, ... are independent
and with similar means and variances, then regardless of their
precise distributions the distribution of (X1+X2+...+Xn)/n
approaches the normal distribution as n->infinity ("Central
Limit Theorem").

How rapidly this happens depends on how much the distributions
of X1,... differ from a normal distribution. One feature which
can cause the approach to "normal" to be slow is skewness: the
more skew the distribution of each X1, ... , the slower the
convergence. The log-normal distribution is positively skewed,
sometimes grossly so -- experiment on the lines of:

  X<-exp(0+1.0*rnorm(10000)); hist(X,n=100)
  X<-exp(0+0.8*rnorm(10000)); hist(X,n=100)
  X<-exp(0+0.6*rnorm(10000)); hist(X,n=100)
  X<-exp(0+0.4*rnorm(10000)); hist(X,n=100)
  X<-exp(0+0.3*rnorm(10000)); hist(X,n=100)
  X<-exp(0+0.2*rnorm(10000)); hist(X,n=100)
  X<-exp(0+0.1*rnorm(10000)); hist(X,n=100)
  X<-exp(1+1.0*rnorm(10000)); hist(X,n=100)
  X<-exp(1+0.8*rnorm(10000)); hist(X,n=100)
  X<-exp(1+0.7*rnorm(10000)); hist(X,n=100)
  X<-exp(1+0.6*rnorm(10000)); hist(X,n=100)
  X<-exp(1+0.4*rnorm(10000)); hist(X,n=100)
  X<-exp(1+0.2*rnorm(10000)); hist(X,n=100)
  X<-exp(1+0.1*rnorm(10000)); hist(X,n=100)
  X<-exp(1+0.05*rnorm(10000)); hist(X,n=100)

(hoping this brings the query "on-topic") to get an impression
of the variety. A few of these look approximately normal as
they stand; the majority do not.

As for exploring the "central limit" tendency, you can try
things like

  N<-500;X<-exp(0+1.0*rnorm(N*1000)); Y<-matrix(X,nrow=N);
     M<-colMeans(Y);hist(M,n=20)
  hist(X,n=100)

[The first line draws a histogram of 1000 means, each of
 N=500 log-normal variates. The second shows a histogram
 of the original N*1000 variates, allowing you to compare
 the two and perceive the extent to which the approach
 to a normal distribution had been achieved. In this case,
 the means still have a perceptibly skew distribution,
 and of course the original data were very heavily skewed.
 You can evaluate the results for less skew log-normals
 in a similar way, building on the information from
 the first series of experiments.

 This may have been a consideration underlying your boss's
 argument: If the original data are heavily skew, then
 the distribution of the monthly means may well still be
 quite skew and better described by a log-normal than by
 a normal. However, your observation that the monthly means
 seem to be close to a normal distribution perhaps indicates
 this was not the case, so probably the original data, though
 log-normal, were not so skew that the N=30 or so gave results
 which were still perceptibly non-normal. (As stated above,
 as N -> infinity, you will eventually get a normal).]

So you can use R usefully to eveluate general statisical
issues of this kind!

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 02-Dec-04                                       Time: 11:30:01
------------------------------ XFMail ------------------------------