[R] randomForest and missing data

Darin A. England england at cs.umn.edu
Thu Jan 4 23:07:10 CET 2007


Yes I completely agree with your statements. As far as a way around
it, I would say that CART has some facilities for dealing with
missing data. e.g. when an observation is dropped into the tree and
encounters a split at which the variable is missing, then one option
is to simply not send it further down the tree. One may then obtain
a prediction for that interior node, albeit probably not a very good
one, but it is one way to handle cases with missing values. So, my
thought is that why can't we simply have that capability with
randomForest as well?

Darin

On Thu, Jan 04, 2007 at 03:44:27PM -0600, Sicotte, Hugues   Ph.D. wrote:
> I don't know about this module, but a general answer is that if you have
> missing data, it may affect your model. If your data is missing at
> random, then you might be lucky in your model building.
> 
> If however your data was not missing at random (e.g. censoring) , you
> might build a wrong predictor.
> 
> Missing at random or not, that is a question you should answer and deal
> with before modeling.
> 
> I refer you to a book like
> "Analysis of Incomplete Multivariate data". By Schafer
> 
> If there is a way around that with randomForest, I'd be interested to
> know too.
> 
> Hugues Sicotte
> 
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Darin A. England
> Sent: Thursday, January 04, 2007 3:13 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] randomForest and missing data
> 
> 
> Does anyone know a reason why, in principle, a call to randomForest
> cannot accept a data frame with missing predictor values? If each
> individual tree is built using CART, then it seems like this
> should be possible. (I understand that one may impute missing values
> using rfImpute or some other method, but I would like to avoid doing
> that.) 
> 
> If this functionality were available, then when the trees are being
> constructed and when subsequent data are put through the forest, one
> would also specify an argument for the use of surrogate rules, just
> like in rpart. 
> 
> I realize this question is very specific to randomForest, as opposed
> to R in general, but any comments are appreciated. I suppose I am
> looking for someone to say "It's not appropriate, and here's why
> ..." or "Good idea. Please implement and post your code."
> 
> Thanks,
> 
> Darin England, Senior Scientist
> Ingenix
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list