[R] options in 'rnorm' to set the lower bound of normal dist

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Mar 27 16:12:51 CET 2008


On Thu, 27 Mar 2008, Ted.Harding at manchester.ac.uk wrote:

> Brian Ripley's suggestions of truncated normal and
> log-normal are of course resources for ensuring that
> you get positive simulated results.

I was answering the question asked!

Wanting to simulate from the fitted mean and sd does *not* mean you want 
to emulate the data -- indeed, it is often done to see if peculiar aspects 
of data have a noticeable effect on the rest of the analysis.

I've done this with analytical chemist colleagues several times, although 
I did point out it was better to use robust summaries.

> However, I think youre real problem (having looked at
> the numbers you quote) is that you should not be thinking
> of using a normal distribution, or anything similar, at all.
>
> Your variables dat$x and dat$y have:
>   mean(dat$x)
> ## [1] 0.3126667
>   sd(dat$x)
> ## [1] 0.3372137
>
>   mean(dat$y)
> ## [1] 0.4223137
>   sd(dat$y)
> ## [1] 0.5120841
>
> so in both cases the SD is about the same as the mean, and
> getting negative values from simulated normal distributions
> with these means and SDs is inevitable.
>
> But now look at the two series of 15 numbers in dat$x and dat$y:
> The first 12 are of the order of 0.15 and 0.20 respectively,
> while the final 3 in each case are of the order of 1.0 and 1.5
> respectively. And this is where your large SD is coming from.
> Neither series is from a simple normal distribution.
>
>   mean(dat$x[1:12])
> ## [1] 0.1519167
>   sd(dat$x[1:12])
> ## [1] 0.02447432
> and it is impossible for the last 3 ( 1.032 0.803 1.032)
> to have come from a normal distribution giving the first 12.
>
>   mean(dat$y[1:12])
> ## [1] 0.1807932
>   sd(dat$y[1:12])
> ## [1] 0.03380083
> with a similar conclusion; and likewise for the last three
> 1.551043 1.063100 1.551043
>
> Note that, for the first 12 in each case, the SD less that
> 1/5 of the mean:
>
>   mean(dat$x[1:12])/sd(dat$x[1:12])
> ## [1] 6.207186
>   mean(dat$y[1:12])/sd(dat$y[1:12])
> ## [1] 5.348779
>
> so, where the first 12 of x and y are concerned, if you sampled
> from normal distributions with the same means and SDs you would
> get a negative number with probability less than
>
>   pnorm(-5)
> ## [1] 2.866516e-07
>
> What you in fact have here is that the numbers are in two groups,
> each with a small SD relative to its mean:
>
>             dat$x             dat$y
>          Mean     SD       Mean     SD
> ------------------------+-------------------
> 1:12    0.152   0.024  |  0.181   0.034
>                        |
> 13:15    0.956   0.132  |  1.39    0.282
>
> Note that for dat$x Mean/SD approx = 6 for each sub-series,
> and for data$y Mean/SD approx = 5 for each subseries, so
> you could be looking at results which display a nearly
> constant coefficient of variation. Now, this is indeed a
> property of the log-normal distribution (as well as of
> others), so that could indeed be worth considering.
> However, you still have to account for the apparent split
> noted above into distinct groups.
>
> So you are really facing a modelling question: why did
> the numbers come out as they did, and what is a good way
> to represent that mechanism as a distribution?
>
> With best wishes,
> Ted.
>
> On 27-Mar-08 12:27:55, Tom Cohen wrote:
>>
>> Dear list,
>>   I have a dataset containing values obtained from two different
>> instruments (x and y).
>> I want to generate 5 samples from normal distribution for each
>> instrument based on
>> their means and standard deviations. The problem is values from both
>> instruments are
>> non-negative, so if using rnorm I would get some negative values. Is
>> there any options
>> to determine the lower bound of normal distribution to be 0 or can I
>> simulate the
>> samples in different ways to avoid the negative values?
>>
>>
>>  > dat
>>     id     x         y
>> 75 101 0.134 0.1911315
>> 79 102 0.170 0.1610306
>> 76 103 0.134 0.1911315
>> 84 104 0.170 0.1610306
>> 74 105 0.134 0.1911315
>> 80 106 0.170 0.1610306
>> 77 107 0.134 0.1911315
>> 81 108 0.170 0.1610306
>> 82 109 0.170 0.1610306
>> 78 111 0.170 0.1610306
>> 83 112 0.170 0.1610306
>> 85 113 0.097 0.2777778
>> 2  201 1.032 1.5510434
>> 1  202 0.803 1.0631001
>> 5  203 1.032 1.5510434
>>
>> mu<-apply(dat[,-1],2,mean)
>> sigma<-apply(dat[,-1],2,sd)
>>   len<-5
>> n<-20
>> s1<-vector("list",len)
>>   set.seed(7)
>> for(i in 1:len){
>>     s1[[i]]<-cbind.data.frame(x=rnorm(n*i,mean=mu[1],sd=sigma[1]),
>>                              y=rnorm(n*i,mean=mu[2],sd=sigma[2]))
>>                }
>>
>> Thanks for any help,
>> Tom
>>
>>
>> ---------------------------------
>> Sök efter kärleken!
>>
>>       [[alternative HTML version deleted]]
>>
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 27-Mar-08                                       Time: 14:00:44
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-help mailing list