[R] bootstrap: stratified resampling
Ramon Diaz-Uriarte
rdiaz at cnio.es
Tue Jun 8 18:48:53 CEST 2004
Dear All,
I was writing a small wrapper to bootstrap a classification algorithm, but if
we generate the indices in the "usual way" as:
bootindex <- sample(index, N, replace = TRUE)
there is a non-zero probability that all the samples belong to only
one class, thus leading to problems in the fitting (or that some classes will
end up with only one sample, which will be a problem for quadratic
discriminant analysis).
It thought this situation should be frequent enough to be mentioned in the
literature, but I have found almost no mention in the references I have
available, except for Hirst (see below). If I've reread correctly, this issue
is not mentioned in Efron & Tibshirani (1997; the .632+ paper), or in Efron
and Gong (the TAS "leisure look" paper), or the Efron & Tibshirani 1993
bootstrap book, or Chernick's "Bootstrap methods" book. I've only seen some
side mentions in Ripley's Pattern recognition (when talking about stratified
cross-validation), and Davison & Hinkley's bootstrap book when, on p. 304,
they refer to some subsets having singular design matrices, and thus
requiring stratification on covars. McLachlan (in his discriminant analysis
book), on p. 347, differentiates between mixture sampling and separate
sampling, but I can find a mention of what do when, under mixture sampling, we
end up with all samples in only one group.
Only Hirst (1996, Technometrics, 38 (4): 389--399) says that each bootstrap
sample should include at least one observation for each group, and at least
enough different observations from each group to allow estimation of the
covariance matrix (he is referring to discriminant analysis), and thus he
uses essentially stratified bootstrap samples.
Interestingly, the "boot" function (boot library) says "For nonparametric
multi-sample problems stratified resampling is used.". As well, the
predab.resample (Design library) says "group: a grouping variable used to
stratify the sample upon bootstrapping. This allows one to handle k-sample
problems, (...)".
That the authors of boot and Design are using stratified resampling indicates
to me that this might be the obvious, unproblematic way to go, but I
understood that stratified resampling was OK only when that was sampling
scheme that generated the data.
What am I missing?
Thanks,
R.
--
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900
http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)
More information about the R-help
mailing list