[R] na.action in randomForest --- Summary

Liaw, Andy andy_liaw at merck.com
Wed Aug 6 06:33:17 CEST 2003


To make it clear:  Version 3.3 and older of Breiman's code do not handle NAs
at all:  you need to exclude them before running random forest.  One can
easily do so by using na.action=na.omit (which, despite the default, seemed
to be what's used).  The airquality data used in the examples in the
randomForest help page has many NAs, for example.

The na.roughfix implements the idea that Breiman added to V4.0 of his code
(only for classification).  For "fancier" imputation by nearest neighbors,
measured by proximities from random forest, there's the "rfImpute" function
(linked from the help page of na.roughfix).  The advantage is that it works
for both regression and classification.

I have not yet implemented all the new features that Leo introduced in V4,
so the version for the randomForest is currently at 3.9-x.  Adding the new
features is more involved than one may think, as I'm adding the features to
the existing code, rather than modifying Leo's new code and put in the
package.  The reason:  I've fixed a few bugs and added a few features in the
package, and I don't want to loose those.

Just so you know, I believe Leo is making some changes to the way imputation
is done in V5...

HTH,
Andy

> -----Original Message-----
> From: David Parkhurst [mailto:parkhurs at ariel.ucs.indiana.edu] 
> Sent: Tuesday, August 05, 2003 3:31 PM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] na.action in randomForest --- Summary
> 
> 
> A few days ago I asked whether there were options other than 
> na.action=na.fail for the R port of Breiman's randomForest;  
> the function's help page did not say anything about other options.
> 
> I have since discovered that a pdf document called "The 
> randomForest  Package" and made available by Andy Liaw (who 
> made the tool available in R---thank you) does discuss an 
> option.  It is an implementation of Breiman's suggestion "to 
> replace each missing value by the median of its column and 
> each missing categorical by the most frequent value in that 
> categorical. My impression is that because of the randomness 
> and the many trees grown, filling in missing values with a 
> sensible values does not effect accuracy much." (from his 
> report, "Manual On Setting Up, Using, And Understanding 
> Random Forests V3.1").
> 
> I now plan to try the na.roughfix option from Liaw's package.
> 
> Thanks to Uwe Ligges and Brian Ripley for their replies to my posting.
> 
> Dave Parkhurst
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA), and/or
its affiliates (which may be known outside the United States as Merck Frosst,
Merck Sharp & Dohme or MSD) that may be confidential, proprietary copyrighted
and/or legally privileged, and is intended solely for the use of the
individual or entity named on this message.  If you are not the intended
recipient, and have received this message in error, please immediately return
this by e-mail and then delete it.




More information about the R-help mailing list