[R] Bootstrap tree selection in rpart

Thu Sep 13 16:58:16 CEST 2007

Hi there, 

Rather than cross validating or bootstrapping to prune a single tree you could use random forest instead. Look at the overview in http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm 

THere is a package in R for doing this called library(randomForest). I have found it to be an excellent method which produces better forecasts (in bag and out-of-bag) than a single tree. Also it allows you still interpret the most important variables. It handles continuous variables and classification variables. 

Regards

Wayne

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org]On Behalf Of Fiona Callaghan
Sent: 13 September 2007 15:31
To: Terry Therneau
Cc: r-help at stat.math.ethz.ch; fmc2 at pitt.edu
Subject: Re: [R] Bootstrap tree selection in rpart

Thanks very much for replying -- just one final question:  does this hold
when the outcome is continuous (and not discrete) e.g instead of the
outcome being multinomial we have a continuous outcome like residuals?

Thanks again
Fiona
> Fiona Callaghan asked about using the bootstrap  instead of
> cross-validation in
> the tree pruning step.
>    It turns out that cross-validation works better than the bootstrap for
> trees.
> The issue is a subtle one.  The bootstrap can be thought of as 2 steps.
>
> 1.  Deduction: Evaluate the behavior of some statistic "zed" under
> repeated
> sampling from the discrete distribution F-hat, i.e., the original data.
> This
> gives a direct evaluation of how zed behaves under F-hat.
>
> 2. Induction: Assume that (behavior of zed under sampling from F) =
> (behavior
> under sampling from F-hat).
>
>   It turns out that trees behave differently under discreet distributions
> than
> they do under continuous ones, so step 2 fails.  Essentially, there are
> fewer
> places to split in the discrete case, tree creation is less noisy, and the
> bootstrap gives an overoptimistic view.  I remember Brad Efron giving a
> talk on
> this long ago (I was still a student!), so the details are fuzzy; I think
> that
> he solved it by sampling from a smoothed version of the empirical CDF.
>
>    Terry Therneau
>

-- 
Fiona Callaghan, MA MS
A432 Crabtree Hall
Department of Biostatistics
Graduate School of Public Health
University of Pittsburgh
Phone 412 624 3063

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.