[R] bootstrap: stratified resampling

Tue Jun 8 18:48:53 CEST 2004

Dear All,

I was writing a small wrapper to bootstrap a classification algorithm, but if 
we generate the indices in the "usual way" as:

bootindex <- sample(index, N, replace = TRUE)

there is a non-zero probability that all the samples belong to only 
one class, thus leading to problems in the fitting (or that some classes will 
end up with only one sample, which will be a problem for quadratic 
discriminant analysis).

It thought this situation should be frequent enough to be mentioned in the 
literature, but I have found almost no mention in the references I have 
available, except for Hirst (see below). If I've reread correctly, this issue 
is not mentioned in Efron & Tibshirani (1997; the .632+ paper), or in Efron 
and Gong (the TAS "leisure look" paper), or the Efron & Tibshirani 1993 
bootstrap book, or Chernick's "Bootstrap methods" book. I've only seen some 
side mentions in Ripley's Pattern recognition (when talking about stratified 
cross-validation), and Davison & Hinkley's bootstrap book when, on p. 304, 
they refer to some subsets having singular design matrices, and thus 
requiring stratification on covars. McLachlan (in his discriminant analysis 
book), on p. 347, differentiates between mixture sampling and separate 
sampling, but I can find a mention of what do when, under mixture sampling, we 
end up with all samples in only one group.

Only Hirst (1996, Technometrics, 38 (4): 389--399) says that each bootstrap 
sample should include at least one observation for each group, and at least 
enough different observations from each group to allow estimation of the 
covariance matrix (he is referring to discriminant analysis), and thus he 
uses essentially stratified bootstrap samples.

Interestingly, the "boot" function (boot library) says "For nonparametric 
multi-sample problems stratified resampling is used.". As well, the 
predab.resample (Design library) says  "group: a grouping variable used to 
stratify the sample upon bootstrapping. This allows one to handle k-sample 
problems, (...)".

That the authors of boot and Design are using stratified resampling indicates 
to me that this might be the obvious, unproblematic way to go, but I 
understood that stratified resampling was OK only when that was sampling 
scheme that generated the data.  

What am I missing?

Thanks,

R.

-- 
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)