[R] random forest significance testing tools

Mon May 11 01:48:04 CEST 2020

Hi everyone. I'm using a random forest in R to successfully perform a  
classification on a dichotomous DV in a dataset that has 29 IVs of  
type double and approximately 285,000 records. I ran my model on a  
70/30 train/test split of the original dataset.

I'm trying to use the rfUtilities package for rf model selection and  
performance evaluation, in order to generate a p-value and other  
quantitative performance statistics for use in hypothesis testing,  
similar to what I would do with a logistic regression glm model.

The initial random forest model results and OOB error estimates were  
as follows:

randomForest(formula = Class ~ ., data = train)
                Type of random forest: classification
                      Number of trees: 500
No. of variables tried at each split: 5

         OOB estimate of  error rate: 0.04%
Confusion matrix:
        0   1  class.error
0 199004  16 8.039393e-05
1     73 271 2.122093e-01

I'm running this model on my laptop (Win10, 8 GB RAM) as I don't have  
access to my server during the pandemic. The rfUtilities function call  
works (or at least it doesn't give me an error message or crash), but  
it's been running for over a day in RStudio on the original rf model  
and the training dataset without providing any results.

For anyone who has used the rfUtilities package before, is this just  
too large of a dataframe for a Win10 laptop to process effectively or  
should I be doing something different? This is my first time using the  
rfUtilities package and I understand that it is relatively new.

The function call for the rfUtilities function rf.significance is as  
follows (rf is my original random forest data model from the  
randomForest function):

rf.perm <- rf.significance(rf, train[,1:29], nperm=99, ntree=500)

Thanks in advance.

Tom Woolman
PhD student, Indiana State University