[R] Bootstrap tree selection in rpart

Wed Sep 12 15:08:57 CEST 2007

Fiona Callaghan asked about using the bootstrap  instead of cross-validation in 
the tree pruning step.  
   It turns out that cross-validation works better than the bootstrap for trees.
The issue is a subtle one.  The bootstrap can be thought of as 2 steps.  

1.  Deduction: Evaluate the behavior of some statistic "zed" under repeated
sampling from the discrete distribution F-hat, i.e., the original data.  This
gives a direct evaluation of how zed behaves under F-hat.

2. Induction: Assume that (behavior of zed under sampling from F) = (behavior
under sampling from F-hat).

  It turns out that trees behave differently under discreet distributions than
they do under continuous ones, so step 2 fails.  Essentially, there are fewer 
places to split in the discrete case, tree creation is less noisy, and the 
bootstrap gives an overoptimistic view.  I remember Brad Efron giving a talk on
this long ago (I was still a student!), so the details are fuzzy; I think that
he solved it by sampling from a smoothed version of the empirical CDF.

   Terry Therneau