[R] Fwd: Help please

Thu Jul 21 11:00:42 CEST 2011

On Jul 20, 2011, at 13:33 , Rolf Turner wrote:

> On 20/07/11 21:08, Simon Knapp wrote:
>> Hi All,
>> 
>> This is not really an R question but a statistical one. If someone could
>> either give me the brief explanation or point me to a reference that might
>> help, I'd appreciate it.
>> 
>> I want to estimate the mean of a log-normal distribution, given the (log
>> scale normal) parameters mu and sigma squared (sigma2). I understood this
>> should simply be:
>> 
>> exp(mu + sigma2)
> 
> I think you meant exp(mu + sigma2/2); that's what you used below
> (and that's what the right answer is.
>> ... but I the following code gives me something strange:
>> 
>> R<- 10000000
>> mu<- -400
>> sigma2<- 200
>> tmp<- rlnorm(R, mu, sqrt(sigma2)) # a sample from the desired log-normal
>> distribution
>> muh<- mean(log(tmp))
>> sigma2h<- var(log(tmp))
>> 
>> #by my understanding, all of the the following vectors should then contain
>> very similar numbers
>> c(mu, muh)
>> c(sigma2, sigma2h)
>> c(exp(mu + sigma2/2), exp(muh + sigma2h/2), mean(tmp))
>> 
>> 
>> I get the following (for one sample):
>>> c(mu, muh)
>> [1] -400.0000 -400.0231
>>> c(sigma2, sigma2h)
>> [1] 200.0000 199.5895
>>> c(exp(mu + sigma2/2), exp(muh + sigma2h/2), mean(tmp))
>> [1] 5.148200e-131 4.097249e-131 5.095888e-150
>> 
>> so they do all contain similar numbers, with the exception of the last
>> vector, which is out by a factor of 10^19. Is this likely to be because one
>> needs **very** large samples to get a reasonable estimate of the mean... or
>> am I missing something?
> This is in effect an FAQ.
> 
> You seem to be missing an understanding of floating point arithmetic.
> All of the last three values that you display are ***zero*** to all practical
> intents and purposes.

Umm, not quite, I think, Rolf. There is an aspect of FPU overflow in this, but it is in the variance: If the sdlog argument is sqrt(200) you get that larger deviations will swamp most other observations 

> exp(3*sqrt(200))
[1] 2.664124e+18

but there's no issue of observations actually underflowing to zero

> range(rlnorm(1000000,-400,15))
[1] 7.456596e-205 1.409446e-141

and as long as you don't get close to the lower limit of representation at ~10^-308, things should be mostly independent of the meanlog value, as it is just a scale factor.

The extreme range in this case does make the variance computation suspect, but if you look at the actual distribution, it will be pretty extreme in any case. Even reducing the sdlog down to 1 will show the effect, try 

> range(rlnorm(100000,0,1))
[1]  0.01236944 64.11340473

or do a hist(...) of the distribution. I.e.,  there _will_ be some rather extreme outliers unless the variance is quite small.

The help page also has the formula for the variance of the lognormal, and if I plug in a mu of 0 and sigma^2=200, I get

> exp(200)*(exp(200)-1)
[1] 5.22147e+173

and a standard error of the mean of

> sqrt(exp(200)*(exp(200)-1)/1e7)
[1] 2.285054e+83

and the theoretical mean is 

> exp(100)
[1] 2.688117e+43

so the original poster is quite right that 10 million observations is far from sufficient to get a sensible estimate of the mean, even in the absense of FPU issues.

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com