[R] Sampling

Tim Hesterberg timh at insightful.com
Wed Feb 6 19:49:24 CET 2008


>   I want to generate different samples using the
>followindg code:
>
>g<-sample(LETTERS[1:2], 24, replace=T)
>
>   How can I specify that I need 12 "A"s and 12 "B"s?

I introduced the concept of "sampling with minimal replacement" into the
S-PLUS version of sample to handle things like this:
	sample(LETTERS[1:2], 24, minimal = T)

This is very useful in variance reduction applications, to approximately
stratify but with introducing bias.  I'd like to see this in R.


I'll raise a related issue - sampling with unequal probabilities,
without replacement.  R does the wrong thing, in my opinion:

> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)
values
  1   2   3 
834 574 592 

The selection probabilities are not proportional to the specified
probabilities.  

In contrast, in S-PLUS:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)
    1   2   3 
 1000 501 499

You can specify minimal = FALSE to get the same behavior as R:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25), minimal = F))
> table(values)
   1   2   3 
 844 592 564

There is a reason this is associated with the concept of sampling with
minimal replacement.  Consider for example:
	sample(1:4, size = 3, prob = 1:4/10)
The expected frequencies of (1,2,3,4) should be proportional
to size*prob = c(.3,.6,.9,1.2).  That isn't possible when sampling
without replacement.  Sampling with minimal replacement allows this;
observation 4 is included in every sample, and is included twice in
20% of the samples.

Tim Hesterberg

Disclaimer - these are my opinions, not those of my employer.

========================================================
| Tim Hesterberg       Senior Research Scientist       |
| timh at insightful.com  Insightful Corp.                |
| (206)802-2319        1700 Westlake Ave. N, Suite 500 |
| (206)283-8691 (fax)  Seattle, WA 98109-3044, U.S.A.  |
|                      www.insightful.com/Hesterberg   |
========================================================
I'll teach short courses:
Advanced Programming in S-PLUS: San Antonio TX, March 26-27, 2008.
Bootstrap Methods and Permutation Tests: San Antonio, March 28, 2008.



More information about the R-help mailing list