[R] AW: random forests for R

Bernd Huwe Hildegard.Goelz-Huwe at t-online.de
Fri Apr 5 10:56:03 CEST 2002

Super. Der Algorithmus gefällt mir sehr gut und scheint auch gar nicht so
schwierig zu realisieren zu sein. Er erinnert etwas an die genetischen

Bernd Huwe

-----Ursprüngliche Nachricht-----
Von: owner-r-announce at stat.math.ethz.ch
[mailto:owner-r-announce at stat.math.ethz.ch]Im Auftrag von Liaw, Andy
Gesendet: Dienstag, 2. April 2002 17:23
An: 'r-announce at lists.R-project.org'
Betreff: random forests for R

Hi all,

There is now a package available on CRAN that provides an R interface to Leo
Breiman's random forest classifier.

Basically, random forest does the following:

1.  Select ntree, the number of trees to grow, and mtry, a number no larger
than number of variables.
2.  For i = 1 to ntree:
3.  Draw a bootstrap sample from the data.  Call those not in the bootstrap
sample the "out-of-bag" data.
4.  Grow a "random" tree, where at each node, the best split is chosen among
mtry randomly selected variables.  The tree is grown to maximum size and not
pruned back.
5.  Use the tree to predict out-of-bag data.
6.  In the end, use the predictions on out-of-bag data to form majority
7.  Prediction of test data is done by majority votes from predictions from
the ensemble of trees.

In the tech report
http://oz.berkeley.edu/users/breiman/randomforest2001.pdf, Breiman showed
that this technique is very competitive to boosting classification trees.
In our own experience, it is competitive with nonlinear classifiers such as
artificial neural nets and support vector machines.  Two of the significant
advantages of random forests over other methods (IMHO) are: a) there is only
one parameter (mtry) to adjust, and the result usually not sensititve to it;
and b) the built-in cross-validation via the use of out-of-bag data gives
quite accurate estimate of test set error, and offers quite effective
protection against overfitting.

The code is based on version 3.1 of the original Fortran code written by
Breiman and Cutler (http://www.stat.berkeley.edu/users/breiman/).  The User
Guide for the Fortran code on Breiman's web site explains some of the
facilities provided in the code (such as assessing variable importance, and
proximity measures).  Some facilities provided in the original Fortran code
have be taken out:  transforming data to principal components, and
multidimensional scaling of the "proximity" matrix.  These can easily be
done in R before and after calls to the random forest functions.  Random
numbers are generated by R's RNG, rather than the one supplied in the
original Fortran code.

I'd like to thank Profs. B. D. Ripley, J. Lindsey, and others on R-help that
answered many of my questions when I was working on this package.  The
formula interface and part of the code in the predict method are out-right
"stolen" from svm() in the e1071 package and nnet() in the VR bundle.

Questions/comments/bugs/patches welcomed!

Andy I. Liaw, PhD
Biometrics Research          Phone: (732) 594-0820
Merck & Co., Inc.              Fax: (732) 594-1565
P.O. Box 2000, RY70-38            Rahway, NJ 07065
mailto:andy_liaw at merck.com

Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and
is intended solely for the use of the individual or entity named on this
message.  If you are not the intended recipient, and have received this
message in error, please immediately return this by e-mail and then delete


r-announce mailing list -- Read
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-announce-request at stat.math.ethz.ch

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list