[R] How to do bootstrap for the complex sample design?

Thu Nov 4 20:24:00 CET 2010

On Fri, Nov 5, 2010 at 3:51 AM, Tim Hesterberg <timhesterberg at gmail.com> wrote:
> Faye wrote:
>>Our survey is structured as : To be investigated area is divided into
>>6 regions, within each region, one urban community and one rural
>>community are randomly selected, then samples are randomly drawn from
>>each selected uran and rural community.
>>
>>The problems is that in urban/rural stratum, we only have one sample.
>>In this case, how to do bootstrap?
>
> You are lucky that your sample size is 1.  If it were 2 you would
> probably have proceeded without realizing that the answers were wrong.
>
> Suppose you had two samples in each stratum.  If you proceed naturally,
> drawing bootstrap samples of size 2 from each stratum, this would
> underestimate variability by a factor of 2.
>
> In general the ordinary nonparametric bootstrap estimates of variability
> are biased downward by a factor of (n-1)/n -- exactly for the mean,
> approximately for other statistics.  In multiple-sample and stratified
> situations, the bias depends on the stratum sizes.
>
> Three remedies are:
> * draw bootstrap samples of size n-1
> * "bootknife" sampling - omit one observation (a jackknife sample), then
>  draw a bootstrap sample of size n from that
> * bootstrap from a kernel density estimate, with kernel covariance equal
>  to empirical covariance (with divisor n-1) / n.
> The latter two are described in
> Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. Smoothing, Proceedings of the Section on Statistics and the Environment, American Statistical Association, 2924-2930.
> http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf
>
> All three are undefined for samples of size 1.  You need to go to some
> other bootstrap, e.g. a parametric bootstrap with variability estimated
> from other data.
>

And the 'survey' package supplies the first option. (It also supplies
a bootstrap sample of size n that allows finite population
corrections, designed for situations with a large n and a high
sampling fraction, such as some business surveys.)

With a sample size of 1 per stratum there are no design-unbiased
estimators of the standard error, so as others have said you need
external data.

       -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland