[R] RandomForest

Wed Aug 20 16:51:12 CEST 2003

Vladimir,

The OOB error rate estimates for the first few iterations is necessarily
variable, as the number of cases that have been OOB (and therefore
predicted) is relatively small.  As an example:

> library(randomForest)
> data(iris)
> err1 <- err2 <- numeric(10)
> set.seed(1)
> for(i in 1:10) err1[i] <- randomForest(Species~., data=iris,
ntree=1)$err.rate
> for(i in 1:10) err2[i] <- randomForest(Species~., data=iris,
ntree=5)$err.rate[5]
> var(err1)
[1] 0.001141612
> var(err2)
[1] 0.0001058114
> err1
 [1] 0.03703704 0.01818182 0.01886792 0.07142857 0.09259259 0.11764706
0.03773585
 [8] 0.03703704 0.08771930 0.05000000
> err2
 [1] 0.06666667 0.04316547 0.07462687 0.05839416 0.04511278 0.05303030
0.05109489
 [8] 0.04316547 0.05925926 0.05714286

When only one tree is grown, the OOB estimate of error rate is based only on
1/e (on the average) cases.  As the number of trees increases, the number of
cases that have been OOB at least once increases, and therefore number of
cases used to estimate error rate increases.  If my brain isn't too rusty,
on the average, the proportion of cases that would have been used to compute
the OOB error rate, when five trees are grown, is approximately 

> 1 - dbinom(0, 5, 1/exp(1))
[1] 0.8990748

I guess Leo never intended people to grow fewer than, say, 50 trees, so
that's probably why he didn't care about this problem.

Regarding your other questions:

1. Right, except that they don't need to add up to one.  The code normalizes
that internally.

2. Some one asked me the exact same question last night (off the list).
Here's what I replied:

Leo did not have any documentation on what those arrays are, but we sort of
figured it out ourselves.

For each tree:
ndbigtree is the number of nodes in the tree

For each node in the tree:
nodestatus is an indicator (-1=terminal)
bestvar is the variable used to split the node (0 if node is terminal)
treemap contains "pointer" to decendant nodes (e.g., [2, 3] means the left
decendant is node 2 and left decendant is node 3, both are 0 if node is
terminal) nodeclass is the class of the node, if terminal (0 otherwise)
xbestsplit is the cutoff used to split the node

For the forest:
pid is the vector of normalized class weights
ncat is the vector of number of categories in the predictors (=1 for
continuous variables) maxcat is max(ncat) nrnodes is the maximum possible
number of nodes in a tree ntree is the number of trees in the forest nclass
is the number of classes in the response

The meaning of xbestsplit for categorical predictor is a bit tricky: a
binary expansion of the possible splits is done to simplify the splitting.
You might find the heuristics in the CART book, but I'm not sure.

HTH,
Andy

> -----Original Message-----
> From: Vladimir N. Kutinsky [mailto:kutinskyv at obninsk.com] 
> Sent: Wednesday, August 20, 2003 10:26 AM
> To: Liaw, Andy; r-help at stat.math.ethz.ch
> Subject: RE: [R] RandomForest
> 
> 
> Andy,
> 
> Does it mean that the error rate does increase as long as the 
> aggregating number of out-of-bag cases reaches the number of 
> all cases?  or, in other words, because the number of points 
> being predicted (right or wrong) gets larger at the first 
> steps of the process?
> 
> If it so then it's all clear now.
> 
> A few more questions.
> 1. What is the format of using the "classwt" parameter? 
> Should it be something like c(0.3,0.6,0.1)?
> 
> 2. Where can I find any information about rf$forest element 
> of the result? namely about its elements like $treemap, 
> $nodeclass etc?
> 
> Thanks,
> Vladimir
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA), and/or
its affiliates (which may be known outside the United States as Merck Frosst,
Merck Sharp & Dohme or MSD) that may be confidential, proprietary copyrighted
and/or legally privileged, and is intended solely for the use of the
individual or entity named on this message.  If you are not the intended
recipient, and have received this message in error, please immediately return
this by e-mail and then delete it.