[R] comparing random forests and classification trees

Wed Jan 31 15:55:53 CET 2007

Amy, et al,

I agree with you and the group that comparing test set classification
errors between the two methods is the way to go.

On interpretation, I find the partial dependence plots from
randomForest are useful - especially when talking to clients about
what the forest "means". See slides 32 to 38 in my recent DMA
presentation below for some examples. (When looking at the plots for
continuous variables, it's really important to pay attention to the
decile rug plot on the x-axis so as to not get distracted by the edges
which apply to a small part of the population)

I would argue that, except for simple "text book" examples, a full
classification tree is not all that easy to interpret. Sure, anyone
can walk through each branch, the over all meaning gets lost in the
trees.

http://loyaltymatrix.com/JimPorzak_RFwithR_DMAAC_Jan07_webinar.pdf

On 1/30/07, Darin A. England <england at cs.umn.edu> wrote:
> Amy,
>
> I have also had this issue with randomForest, that is, you lose the
> ability to explain the classifier in a simple way to
> non-specialists (everyone can understand the single decision tree.)
> As far as comparing the accuracy of the two, I think that you are
> correct in comparing them by the actual vs predicted tables.
> randomForest reports this as the confusion matrix, and it also
> reports the out-of-bag error, which I think you are referring to. I
> would not compare the rf out-of-bag error with the rpart relative
> error (or cross-validated error if you are doing cross validation.)
>
> So, for what it's worth I think you are correct. Also, do you know
> about ctree in the "party" package? If you want to retain the
> explanatory power of a single tree and have a nice accurate
> classifier, I have found ctree to work quite well.
>
> HTH,
>
> Darin
>
> On Mon, Jan 29, 2007 at 11:34:51AM +1100, Amy Koch wrote:
> > Hi,
> >
> > I have done an analysis using 'rpart' to construct a Classification Tree. I
> > am wanting to retain the output in tree form so that it is easily
> > interpretable. However, I am wanting to compare the 'accuracy' of the tree
> > to a Random Forest to estimate how much predictive ability is lost by using
> > one simple tree. My understanding is that the error automatically displayed
> > by the two functions is calculated differently so it is therefore incorrect
> > to use this as a comparison. Instead I have produced a table for both
> > analyses comparing the observed and predicted response.
> >
> > E.g. table(data$dependent,predict(model,type="class"))
> >
> > I am looking for confirmation that (a) it is incorrect to compare the error
> > estimates for the two techniques and (b) that comparing the
> > misclassification rates is an appropriate method for comparing the two
> > techniques.
> >
> > Thanks
> >
> > Amy
> >
> >
> >
> >
> >
> > Amelia Koch
> >
> > University of Tasmania
> >
> > School of Geography and Environmental Studies
> >
> > Private Bag 78 Hobart
> >
> > Tasmania, Australia 7001
> >
> > Ph: +61 3 6226 7454
> >
> > ajkoch at utas.edu.au
> >
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
HTH,
Jim Porzak
Loyalty Matrix Inc.
San Francisco, CA