[R] Permutation or Bootstrap to obtain p-value for one sample

peter dalgaard pdalgd at gmail.com
Sun Oct 9 09:52:06 CEST 2011


On Oct 8, 2011, at 16:04 , francy wrote:

> Hi, 
> 
> I am having trouble understanding how to approach a simulation:
> 
> I have a sample of n=250 from a population of N=2,000 individuals, and I
> would like to use either permutation test or bootstrap to test whether this
> particular sample is significantly different from the values of any other
> random samples of the same population. I thought I needed to take random
> samples (but I am not sure how many simulations I need to do) of n=250 from
> the N=2,000 population and maybe do a one-sample t-test to compare the mean
> score of all the simulated samples, + the one sample I am trying to prove
> that is different from any others, to the mean value of the population. But
> I don't know:
> (1) whether this one-sample t-test would be the right way to do it, and how
> to go about doing this in R
> (2) whether a permutation test or bootstrap methods are more appropriate
> 
> This is the data frame that I have, which is to be sampled:
> df<-
> i.e.
> x y
> 1 2
> 3 4
> 5 6
> 7 8
> . .
> . .
> . .
> 2,000
> 
> I have this sample from df, and would like to test whether it is has extreme
> values of y. 
> sample1<-
> i.e.
> x y
> 3 4
> 7 8
> . .
> . .
> . .
> 250
> 
> For now I only have this: 
> 
> R=999 #Number of simulations, but I don't know how many...
> t.values =numeric(R)	 #creates a numeric vector with 999 elements, which
> will hold the results of each simulation. 
> for (i in 1:R) {
> sample1 <- df[sample(nrow(df), 250, replace=TRUE),] 
> 
> But I don't know how to continue the loop: do I calculate the mean for each
> simulation and compare it to the population mean? 
> Any help you could give me would be very appreciated,
> Thank you. 

The straightforward way would be a permutation test, something like this

msamp <- mean(sample1$y)
mpop <- mean(df$y)
msim <- replicate(10000, mean(sample(df$y, 250)))

sum(abs(msim-mpop) >= abs(msamp-mpop))/10000

I don't really see a reason to do bootstrapping here. You say you want to test whether your sample could be a random sample from the population, so just simulate that sampling (which should be without replacement, like your sample is). Bootstrapping might come in if you want a confidence interval for the mean difference between your sample and the rest.

Instead of sampling means, you could put a full-blown t-test inside the replicate expression, like:

psim <- replicate(10000, {s<-sample(1:2000, 250); t.test(df$y[s], df$y[-s])$p.value})

and then check whether the p value for your sample is small compared to the distribution of values in psim.

That'll take quite a bit longer, though; t.test() is a more complex beast than mean(). It is not obvious that it has any benefits either, unless you specifically wanted to investigate the behavior of the t test. 

(All code untested. Caveat emptor.)


-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list