[R] how to generate a random data from a empirical distribition

Tue Jul 27 11:15:34 CEST 2010

Hi Dennis,

you should take a look at the CRAN task view for distributions
http://cran.r-project.org/web/views/Distributions.html

Beside that our distr-family of packages might be useful, see also
http://www.jstatsoft.org/v35/i10/
http://cran.r-project.org/web/packages/distrDoc/vignettes/distr.pdf

Best,
Matthias

On 27.07.2010 10:37, Dennis Murphy wrote:
> Hi:
>
> On Mon, Jul 26, 2010 at 11:36 AM, xin wei<xinwei at stat.psu.edu>  wrote:
>
>>
>> hi, this is more a statistical question than a R question. but I do want to
>> know how to implement this in R.
>> I have 10,000 data points. Is there any way to generate a empirical
>> probablity distribution from it (the problem is that I do not know what
>> exactly this distribution follows, normal, beta?). My ultimate goal is to
>> generate addition 20,000 data point from this empirical distribution
>> created
>> from the existing 10,000 data points.
>> thank you all in advance.
>>
>
> The problem, it seems to me, is the leap of faith you're taking that the
> empirical distribution of your manifest sample will serve as a useful
> data-generating mechanism for the 20,000 future observations you want to
> take. I would think that, if you intend to take a sample of 20,000 from ANY
> distribution, you would want some confidence in the specification of said
> distribution.
>
> Even if you don't know exactly what type of population distribution you're
> dealing with, there are ways to narrow down the set of possibilities. What
> is the domain/support of the distribution? For example, the Normal is
> defined on all of R (as in the real numbers, not our favorite statistical
> programming language), whereas the lognormal, Gamma and Weibull
> distributions, among others, are defined on the nonnegative reals. The beta
> distribution is defined on [0, 1]. Therefore, knowledge of the domain is
> useful in and of itself. Is it plausible that the distribution is symmetric,
> or should it have a distinct left or right skew? (Similar comments apply to
> discrete distributions.) Is censoring or truncation a relevant concern? If
> there is a random process that well describes how the data you observe are
> generated, that will certainly narrow down the class of potential
> data-generating mechanisms/distributions.
>
> Once you've narrowed down the class of possible distributions as much as
> possible, you could look into the fitdistr() function in MASS or the
> fitdistrplus package on CRAN to test out which candidates seem plausible wrt
> your existing sample and which are not. You are not likely to be able to
> narrow it down to one family of distributions, but you should have a much
> better idea about the characteristics of the distribution that gave rise to
> your sample of 10,000 (assuming, of course, that it is a *random* sample)
> after going through this exercise, which you can apply to the generation of
> the next 20,000 observations.
>
> OTOH, if your existing 10,000 observations were not produced by some random
> process, all bets are off.
>
> HTH,
> Dennis
>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Dr. Matthias Kohl
www.stamats.de