[R] randomForest and factor predictors--unexpected results

Richard L. Valliant rvallian at umd.edu
Thu Jan 28 00:38:14 CET 2016


I'm been experimenting with the randomForest R package (v. 4.6-12) and getting an unexpected difference between rpart and randomForest results that may have something to do with using x's that are factors.  

The same model (see code below) is used to predict a 2-value variable called "resp" that is treated as a factor.  Four x's are used that are factors.

The rpart predicted probabilities average to the same as mean(resp) when used on the full dataset.  This seems OK.  
The randomForest predicted probabilities average is quite a bit different from mean(resp).  This seems unexpected since random forests amount to repeatedly doing variations of what rpart does.

Has anyone seen anything like this or see what I am doing wrong?

(I did the same comparison using the kyphosis dataset in rpart with all continuous predictors and found consistent average predicted probabilities between rpart and randomForest.)

Here's the code ... 

require(PracTools)	# R package with dataset used
require(rpart)
require(randomForest)

data(nhis)  # dataset in PracTools
table(nhis$resp)/nrow(nhis)
#        0         1
#0.3098952 0.6901048

t1 <- rpart(resp ~ age + as.factor(hisp) + as.factor(race) + as.factor(parents_r) + as.factor(educ_r),
      method = "class",
      control = rpart.control(minbucket = 50, cp=0),
      data = nhis)
rpart.prob <- predict(object = t1, newdata = nhis, type = "prob")
apply(rpart.prob,2,mean)
#        0         1
#0.3098952 0.6901048    mean of rpart predictions same as mean(resp)

rf.nhis <- randomForest(as.factor(resp) ~ age + as.factor(hisp) + as.factor(race)
                        + as.factor(parents_r) + as.factor(educ_r),
                    importance = TRUE, na.action = na.omit, mtry=5,
                    ntree = 1000, classwt = c(0.31, 0.69),
                        # cycled through mtry =1,...,5; the lower mtry is, the worse are the predicted probs
                    data = nhis)
rfnhis.prob <- predict(object = rf.nhis, newdata = nhis, type = "prob")
apply(rfnhis.prob,2,mean)
#        0         1
#0.2485541 0.7514459    not too close to mean(resp)

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
randomForest_4.6-12

Thanks for any help,
Richard Valliant
Universities of Maryland and Michigan



More information about the R-help mailing list