[R] rpart vs. randomForest

Mon Apr 14 19:37:21 CEST 2003

One of these days I promise to write a package vignette...

As Martin said, RF uses many trees (500 by default).  The "forest" component
of the randomForest object contains all the trees, but not in a easily
readable form (because I don't see much use in "looking" at the trees except
for debugging purposes).  If you really want to see what a tree look like,
grow just one tree and look at the "forest" component.  Here are some
explanation:

For each tree: 
o  "nrnodes" is the maxinum number of nodes a tree can have.  

o  "ndbigtree" is a vector of length ntree containing the total number of
nodes in the trees.

o  "nodestatus" is a nrnodes by ntree matrix of indicators: -1 if the node
is terminal.

o  "treemap" a 3-D array, containing a two-column matrix for each tree.  The
first column indicate which node is the "left decendent" and the second
column the "right decendent".  Both are 0 if the node is terminal.

o  "bestvar" is a nrnodes by ntree matrix that indicate, for each node,
which variable is used to split that node.  0 for terminal nodes.

o  "xbestsplit" is the same as "bestvar", except it tells where to split.

One thing people should keep in mind about the "predicted" component of the
randomForest object (and the confusion matrix for the training data), as
well as "predict(rf.object)" without giving the newdata for prediction:
That prediction is based on Out-of-Bag samples, so is *NOT* the same as
usual prediction on training data.  It is closer to the out-of-sample
prediction as in, e.g., cross-validation.

AFAIK there are only empirical and anecdotal evidence on sensitivity of
performance to value of mtry.  I can say that in my own experience, fiddling
with mtry will only give at best marginal improvement.  One easy way to
answer the question for your situation is to try it yourself and see.

With MDS on proximity matrix, you probably need to be a bit careful in its
interpretation.  The proximity matrix of the training data is computed on
the *entire* training data, rather than just the out of bag portion.  Thus
the MDS plot will quite often show the different classes that look more
"separable" than they really are.  (We are thinking about a fix.  Breiman
pointed out that the difficulty is that if the proximity matrix is
calculated only on the out-of-bag data, than 1-proximity is no longer
positive definite).

HTH,
Andy

> -----Original Message-----
> From: chumpmonkey at hushmail.com [mailto:chumpmonkey at hushmail.com]
> Sent: Saturday, April 12, 2003 5:41 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] rpart vs. randomForest
> 
> 
> 
> Greetings. I'm trying to determine whether to use rpart or 
> randomForest
> for a classification tree. Has anybody tested efficacy formally? I've
> run both and the confusion matrix for rf beats rpart. I've looking at
> the rf help page and am unable to figure out how to extract the tree.
> But more than that I'm looking for a more comprehensive user's guide
> for randomForest including the benefits on using it with MDS. 
> Can anybody
> suggest a general guide? I've been finding a lot of broken links and
> cs-type of web pages rather than an end-user's guide. Also 
> people's experience
> on adjusting the mtry param would be useful. Breiman says 
> that it isn't
> too sensitive but I'm curious if anybody has had a different 
> experience
> with it. Thanks in advance and apologies if this is too general.
> 
> 
> 
> Concerned about your privacy? Follow this link to get
> FREE encrypted email: 
> 

> 
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 

------------------------------------------------------------------------------