[R] comparing random forests and classification trees

Thu Sep 17 17:16:40 CEST 2009

The validate.rpart function in the rms package will handle the rpart 
part of this.  It makes sure that the tree is re-built from scratch for 
each re-sample.  It estimates MSE and Somers' Dxy (twice (ROC area -.5)).

Frank

jamesmcc wrote:
> Greetings tree and forest coders-
> 
> I'm interested in comparing randomforests and regression tree/ bagging tree
> models. I'd like to propose a basis for doing this, get feedback, and
> document this here. I kept it in this thread since that makes sense.
> 
> In this case I think it's appropriate to compare the R^2 values as one basic
> measure. I'm actually going to compare mean error (ME), mean absolute error
> (MAE), root mean squared error (RMSE) as well. This means that I need
> estimates from each approach so that I can form residuals. **As I see it,
> the important details are in how to set up the models so that I have
> comparable estimates, particularly in how the trees/forests are trained and
> evaluated.**
> 
> For regression/bagging trees, the typical approach for my application is 100
> runs of 10-fold CV. In each run all the values are estimated in an
> out-of-the-bag sense; each "fold" is estimated while it is withheld from
> fitting, thus fit is not inflated. The estimates are then averaged over the
> 100 runs at each point to get an average simulation and this is used to
> calculate residuals and the measures mentioned above. Somewhat more
> specifically, the steps are: I fit a model, I prune it via inspection, I
> loop 100 times on xpred.rpart(model,xval=10,cp=cp at bottom of cptable from
> pruned fit) to generate the 100 runs (bagging is thus performed while
> holding the cp criteria fixed?), I average these pointwise, I calculate the
> desired stats/quantities for comparison to other models.
> 
> For randomForests, I would want to fit the model in a similar way, ie 100
> runs of 10-fold CV. I think the 10-fold part is clear, the 100 runs, maybe
> less so. To get 10-fold OOB estimates, I set replace=FALSE,
> sampsize=.9*nrow(x). Then I get a randomForest with $predicted being the
> average OOB estimates over all trees for which each point was OOB. I would
> assume that each tree is constructed with a different 10-fold partitioning
> of the data set. Thus the number of runs is really more like the number of
> trees constructed. If i wanted to be really thorough, I could fit 100 random
> forests and get the $predicted for each and then average these pointwise.
> But that seems like over kill; isnt that the lesson of plot.randomForest
> that as the # of trees goes up the error converges to some limit. (from what
> i've seen). 
> 
> Thus, my primary concern is in the amount of data used for training and
> cross validating the model in an out-of-bag sense; can i meaningfully
> compare 10-fold oob estimates sing xpred.rpart to a random forest fit using
> 90% of the data as sampsize? 
> 
> Of secondary concern is the number of bagging trees versus then number of
> trees in the random forest. As long as the average estimate error is nearing
> some limit with the number of bagging trees I'm using, I think this is all
> that matters. So this is more of methodological difference to be retained,
> similar to differences in pruning under bagging and random forests, though I
> should probably specify the node sizes to be similar for each.
> 
> Am I overlooking anything of grave consequence?
> 
> Any and all thoughts are welcome. If you are aware of any comparisons of
> rpart and randomForests in the literature for any field (for regression) of
> which I am ignorant, I would appreciate the tip. I have read over "Newer
> Classification and Regression Tree Techniques: Bagging and Random Forests
> for Ecological Prediction" by Prasad, Iverson, and Liaw. I may have missed
> it, but I did not see discussion of maintaining consistency in the way the
> models were trained, though it is a very nice paper overall and contained
> many interesting approaches and points. 
> 
> Thanks in advance, 
> 
> James
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University