[R] Random # generator accuracy
Jim Bouldin
jrbouldin at ucdavis.edu
Fri Jul 24 01:34:50 CEST 2009
Perfectly explained Ted. One might, at first reflection, consider that
simply repeating the values 7 through 12 and sampling (w/o replacement)
from among the 18 resulting values, would be similar to just doubling the
selection probabilities for 7 through 12 and then sampling. That would
clearly not be true though.
Jim
>
> Whereas, if you replace x = c(1,2,3,4,5,6,7,8,9,110,11,12)
> with the "weighted" equivalent, doubling up 7-12 as in your
> x2 = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12),
> each of the 18 items now has the same weight as the others,
> and the "unweighted" sampling
> mean(replicate(1000000,(sample(x2, 3))))
> now gives the mean of the 18 values (7.5); whereas -- as my
> calculation showed -- the effect of the "sequential" weighting is
> to bias the result slightly downwards (in your example; generally,
> in favour of the items with lower weights), since the way weighting
> works in sample() is not equivalent to replicating each item "weight"
> times.
>
> The general problem, of sampling without replacement in such a way
> that for each item the probability that it is included in the sample
> is proportional to a pre-assigned weight ("sampling with probability
> proportional to size") is quite tricky and, for certain choices
> of weights, impossible. For a glimpse of what's inside the can of
> worms, have a look at the reference manual for the 'sampfling'
> package, in particular the function samprop():
>
> http://www.stats.bris.ac.uk/R/web/packages/sampfling/sampfling.pdf
>
> Ted.
>
> On 23-Jul-09 20:56:43, Jim Bouldin wrote:
> >
> > You are absolutely correct Ted. When no weights are applied it doesn't
> > matter if you sample with or without replacement, because the
> > probability
> > of choosing any particular value is equally distributed among all such.
> > But when they're weighted unequally that's not the case.
> >
> > It is also interesting to note that if the problem is set up slightly
> > differently, by say defining the vector x as:
> > x = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12), effectively giving
> > the
> > same probability of selection for the 12 integers as before, the same
> > problem does not arise, or at least not as severely:
> >
> >> x2
> > [1] 1 2 3 4 5 6 7 8 9 10 11 12 7 8 9 10 11 12
> >
> >> d = mean(replicate(1000000,(sample(x2, 3))));d # (1 million samples
> >> from
> > x2, of size 3; the mean should be 7.50)
> > [1] 7.499233
> >
> >> e = mean(replicate(1000000,(sample(x2, 3, replace = TRUE))));e # (1
> > million samples from x2, of size 3; with replacement this time the mean
> > should still be 7.50)
> > [1] 7.502085
> >
> >> d/e
> > [1] 0.9996198
> >
> > Jim
> >>
> >> To obtain the result you expected, you would need to explicitly
> >> specify "replace=TRUE", since the default for "replace" is FALSE.
> >> (though probably what you really intended was sampling without
> >> replacement).
> >>
> >> -- when
> >> replace=FALSE, the probability of inclusion of an element is not
> >> proportional to its weight in 'prob'.
> >>
> >> The reason is that elements with higher weights are more likely
> >> to be chosen early on. This then knocks that higher weight out
> >> of the contest, making it more likely that elements with smaller
> >> weights will be sampled subsequently. Hence the mean of the sample
> >> will be biased slightly downwards, relative to the weighted mean
> >> of the values in x.
> >>
> >> table(replicate(1000000,(sample(x, 3))))
> >> # 1 2 3 4 5 6
> >> # 250235 250743 249603 250561 249828 249777
> >> # 7 8 9 10 11 12
> >> # 249780 250478 249591 249182 249625 250597
> >>
> >> (so all nice equal frequencies)
> >>
> >> table(replicate(1000000,(sample(x, 3,prob=weights))))
> >> # 1 2 3 4 5 6
> >> # 174873 175398 174196 174445 173240 174110
> >> # 7 8 9 10 11 12
> >> # 325820 326140 325289 325098 325475 325916
> >>
> >> Note that the frequencies of the values with weight=2 are a bit
> >> less than twice the frequencies of the values with weight=1:
> >>
> >> (325820+326140+325289+325098+325475+325916)/
> >> (174873+175398+174196+174445+173240+174110)
> >> # [1]
> >>
> >>
> >> In fact this is fairly easily caluclated. The possible combinations
> >> (in order of sampling) of the two weights, with their probabilities,
> >> are:
> >> 1s 2s
> >> -------------------------------------------
> >> 1 1 1 P = 6/18 * 5/17 * 4/16 3 0
> >> 1 1 2 P = 6/18 * 5/17 * 12/16 2 1
> >> 1 2 1 P = 6/18 * 12/17 * 5/15 2 1
> >> 1 2 2 P = 6/18 * 12/17 * 10/15 1 2
> >> 2 1 1 P = 12/18 * 6/16 * 5/15 2 1
> >> 2 1 2 P = 12/18 * 6/16 * 10/15 1 2
> >> 2 2 1 P = 12/18 * 10/16 * 6/14 1 2
> >> 2 2 2 P = 12/18 * 10/16 * 8/14 0 3
> >>
> >> So the expected number of "weight=1" in the sample is
> >>
> >> 3*(6/18 * 5/17 * 4/16) + 2*(6/18 * 5/17 * 12/16) +
> >> 2*(6/18 * 12/17 * 5/15) + 1*(6/18 * 12/17 * 10/15) +
> >> 2*(12/18 * 6/16 * 5/15) + 1*(12/18 * 6/16 * 10/15) +
> >> 1*(12/18 * 10/16 * 6/14) + 0
> >> = 1.046218
> >>
> >> Hence the expected number of "weight=2" in the sample is
> >>
> >> 3 - 1.046218 = 1.953782
> >>
> >> and their ratio 1.953782/1.046218 = 1.867471
> >>
> >> Compare this with the value 1.867351 (above) obtained by simulation!
> >>
> >> Ted.
> >>
> >> --------------------------------------------------------------------
> >> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> >> Fax-to-email: +44 (0)870 094 0861
> >> Date: 23-Jul-09 Time: 21:05:07
> >> ------------------------------ XFMail ------------------------------
> >>
> >
> > Jim Bouldin, PhD
> > Research Ecologist
> > Department of Plant Sciences, UC Davis
> > Davis CA, 95616
> > 530-554-1740
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 23-Jul-09 Time: 23:00:56
> ------------------------------ XFMail ------------------------------
>
Jim Bouldin, PhD
Research Ecologist
Department of Plant Sciences, UC Davis
Davis CA, 95616
530-554-1740
More information about the R-help
mailing list