[R] Bootstrap tree selection in rpart
Terry Therneau
therneau at mayo.edu
Wed Sep 12 15:08:57 CEST 2007
Fiona Callaghan asked about using the bootstrap instead of cross-validation in
the tree pruning step.
It turns out that cross-validation works better than the bootstrap for trees.
The issue is a subtle one. The bootstrap can be thought of as 2 steps.
1. Deduction: Evaluate the behavior of some statistic "zed" under repeated
sampling from the discrete distribution F-hat, i.e., the original data. This
gives a direct evaluation of how zed behaves under F-hat.
2. Induction: Assume that (behavior of zed under sampling from F) = (behavior
under sampling from F-hat).
It turns out that trees behave differently under discreet distributions than
they do under continuous ones, so step 2 fails. Essentially, there are fewer
places to split in the discrete case, tree creation is less noisy, and the
bootstrap gives an overoptimistic view. I remember Brad Efron giving a talk on
this long ago (I was still a student!), so the details are fuzzy; I think that
he solved it by sampling from a smoothed version of the empirical CDF.
Terry Therneau
More information about the R-help
mailing list