Michael Ash mash at econs.umass.edu
Wed Jun 21 17:08:30 CEST 2017

```I have an advanced question about bootstrapping.

There are two datasets.  In each bootstrap iteration, I would like to
sample
One observation per cluster from the first dataset.
N observations with replacement from the second dataset.

Right now I am using dplyr::sample_n() for first dataset, with this
sampling embedded in the program that boot() from the boot package is
running to sample the second dataset and produce the estimates.

I would prefer to do the entire sampling in the boot() part as opposed to
embedding the sample_n() statement.  The reason is so that the "original"
results will indeed be on the full data rather than on a particular sample
from the first dataset.

Any thoughts on how to implement? I think that this involves using strata
and weights to "fool" boot to sample from a concatenation of the two
datasets. The two datasets have entirely different contents (variable and
numbers of observations.  MWE follows:

library(boot)
library(car)
library(dplyr)

(first.df  <- data.frame(cluster=gl(2,2,4),z=seq(1,2)))
(second.df  <- data.frame(y=1:2))

boot_script  <- function(X,d) {
zbar  <- mean(sample_n(group_by(first.df,cluster),1)\$z)
return( c(zbar,  zbar * mean(X[d,"y"]) ))
}

## Results based on the original data
(original.zbar  <- mean(first.df\$z))
mean(original.zbar * second.df[,"y"])

## Bootstrapped results
## Problem: "Original" is itself based on a sampling
for( i in c(1:10)) {
b  <- boot(second.df, boot_script, R=100)
print(summary(b))
}

Thank you very much.

--
Michael Ash, Chair, Department of Economics
Professor of Economics and Public Policy
University of Massachusetts Amherst
Email mash at econs.umass.edu