[R] randomForest and missing data

Sicotte, Hugues Ph.D. Sicotte.Hugues at mayo.edu
Thu Jan 4 22:44:27 CET 2007

I don't know about this module, but a general answer is that if you have
missing data, it may affect your model. If your data is missing at
random, then you might be lucky in your model building.

If however your data was not missing at random (e.g. censoring) , you
might build a wrong predictor.

Missing at random or not, that is a question you should answer and deal
with before modeling.

I refer you to a book like
"Analysis of Incomplete Multivariate data". By Schafer

If there is a way around that with randomForest, I'd be interested to
know too.

Hugues Sicotte

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Darin A. England
Sent: Thursday, January 04, 2007 3:13 PM
To: r-help at stat.math.ethz.ch
Subject: [R] randomForest and missing data

Does anyone know a reason why, in principle, a call to randomForest
cannot accept a data frame with missing predictor values? If each
individual tree is built using CART, then it seems like this
should be possible. (I understand that one may impute missing values
using rfImpute or some other method, but I would like to avoid doing

If this functionality were available, then when the trees are being
constructed and when subsequent data are put through the forest, one
would also specify an argument for the use of surrogate rules, just
like in rpart. 

I realize this question is very specific to randomForest, as opposed
to R in general, but any comments are appreciated. I suppose I am
looking for someone to say "It's not appropriate, and here's why
..." or "Good idea. Please implement and post your code."


Darin England, Senior Scientist

R-help at stat.math.ethz.ch mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list