[R-pkgs] new version of randomForest (4.0-7)

Mon Jan 12 04:22:32 CET 2004

Dear R users,

I've just released a new version of randomForest (available on CRAN now).
This version contained quite a number of new features and bug fixes,
compared to version prior to 4.0-x (and few more since 4.0-1).

For those not familiar with randomForest, it's an ensemble
classifier/regression tool.  Please see
http://www.math.usu.edu/~adele/forests/ for more detailed information, as
well as the Fortran code.

Comments/questions/bugs reports/patches much appreciated! 

A few notes about the new version:

o  There is a new tuneRF() function for searching for the optimal mtry,
following Breiman's suggestion.  PLEASE use it to see if result can be
improved!

o  A new variable importance measure replaces the one based on margin.  This
new measure is the same as in Breiman's V5.  The analogous measure is also
implemented for regression.  This new measure is designed to be more robust
against data where predictor variables have very different number of
possible splits (i.e., unique values/categories).  The previous measure
tends to make variables with more possible splits look more important.

o  For classification, the new meassure is also computed on a per-class
basis.

o  There is the new `sampsize' option for down-sampling larger classes.
E.g., if in a two-class problem, there are 950 class 1s and 50 class 2s, use
sampsize=c(50, 50) will usually give a more `balanced' classifier.

o  There is a new importance() function for extracting the importance
measure.

o  The predict() method has an option to return predictions by the component
trees.

o  There is a new getTree() function for looking at one of the trees in the
forest. 

o  For dealing with missing values in the predictor variables, there are
na.roughfix() and rfImpute(), which correspond to the `missquick' and
`missright' options in Breiman's V4/V5 code.  Both works for classification
as well as regression.

o  There is an experimental bias reduction step in regression (the corr.bias
argument in randomForest) that could be very effective for some data (but
essentially no effect for some others).

Some notes about differences between the package and Breiman's Fortran code:

o  Breiman uses the class weights to cast weighted votes.  This is not done
in the R version.  However, one can use the threshold argument to
randomForest to get similar (but not exactly the same) effect.

o  In Breiman's V4/V5 code, the Gini-based importance is weighted by the
out-of-bag data.  This has not been implemented in the R version.

o  Breiman's V4/V5 code can handle categorical predictors with more than 32
categories.  This has not been implemented in the R version.

o  Breiman's classification code uses mtry differently than the R version:
the mtry variables are sampled *with replacement* at each node.  The R
version samples without replacement, so that if mtry is set to number of
predictors, one gets the same result as bagging.  Breiman's regression code
*does* sample the variables without replacement.

o  In the R version, ties are randomly broken when finding best variables,
or when making predictions.  In Breiman's code, the first one found wins.

o  The `prototypes' Breiman described have not been implemented.  There are
situations when they can be misleading, so I have chosen not to implement
it.

o  The `interaction detection' feature in Breiman's V5 has not been
implemented (but is fairly high on my to-do list).

Best,
Andy

Andy Liaw, PhD
Biometrics Research      PO Box 2000, RY33-300     
Merck Research Labs           Rahway, NJ 07065
mailto:andy_liaw at merck.com        732-594-0820

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}