[R] varimp in party (or randomForest)

Jason Jones Medical Informatics Jason.Jones3 at imail.org
Fri Sep 26 00:56:56 CEST 2008


There is an excellent article at http://www.biomedcentral.com/1471-2105/9/307 by Stroble, et al. describing variable importance in random forests.  Does anyone have any suggestions (besides imputation or removal of cases) for how to deal with data that *have* missing data for predictor variables?

Below is an excerpt of some code referenced in the article.  I have commented out one line and added one additional line.  The code runs beautifully if only complete cases are included and (though it builds the tree) breaks at the variable importance step missing data are presented.

# From http://www.biomedcentral.com/content/supplementary/1471-2105-8-25-S1.R


arabidopsis_url <- "http://www.biomedcentral.com/content/supplementary/1471-2105-5-132-S1.txt"

arabidopsis <- read.table(arabidopsis_url, header = TRUE,
                          sep = " ", na.string = "X")

#arabidopsis <- subset(arabidopsis, complete.cases(arabidopsis))
arabidopsis <- subset(arabidopsis, is.na(arabidopsis$edit)==FALSE)

arabidopsis <- arabidopsis[, !(names(arabidopsis) %in% c("X0", "loc"))]

my_cforest_control <- cforest_control(teststat = "quad",
    testtype = "Univ", mincriterion = 0, ntree = 50, mtry = 3,
    replace = TRUE)

my_cforest <- cforest(edit ~ ., data = arabidopsis,
                      controls = my_cforest_control)
varimp_cforest <-  varimp(my_cforest)

By the way, the same issue arises for the randomForest package.

Does anyone have any suggestions?  I'm more interested in the variable importance than the tree per se.



Jason Jones, PhD
Medical Informatics
j.jones at imail.org

More information about the R-help mailing list