[Rd] Using sample() to sample one value from a single value?

Thu Nov 4 15:42:52 CET 2010

On Wed, Nov 3, 2010 at 3:54 PM, Henrik Bengtsson <hb at biostat.ucsf.edu>wrote:

> Hi, consider this one as an FYI, or a seed for further discussion.
>
> I am aware that many traps on sample() have been reported over the
> years.  I know that these are also documents in help("sample").  Still
> I got bitten by this while writing
>...
> All of the above makes sense when one study the code of sample(), but
> sample() is indeed dangerous, e.g. imagine how many bootstrap
> estimates out there quietly gets incorrect.

Nonparametric bootstrapping from a sample of size 1 is <always> incorrect.
If you draw a single observation from a sample of size 1, you get that
same observation back.  This implies zero sampling variability, which
is wrong.  If this single sample represents one stratum or sample in
a larger problem, this would contribute zero variability to the overall
result, again wrong.

In general, the ordinary bootstrap underestimates variability in
small samples.  For a sample mean, the ordinary bootstrap corresponds
to using an estimate of variance equal to (1/n) sum((x - mean(x))^2),
instead of a divisor of n-1.  In stratified and multi-sample applications
the downward bias is similarly (n-1)/n.

Three remedies are:
* draw bootstrap samples of size n-1
* "bootknife" sampling - omit one observation (a jackknife sample), then
  draw a bootstrap sample of size n from that
* bootstrap from a kernel density estimate, with kernel covariance equal
  to empirical covariance (with divisor n-1) / n.
The latter two are described in 
Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. Smoothing, Proceedings of the Section on Statistics and the Environment, American Statistical Association, 2924-2930.
http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf

All three are undefined for samples of size 1.  You need to go to some
other bootstrap, e.g. a parametric bootstrap with variability estimated
from other data.

Tim Hesterberg