[R] random forest significance testing tools
Tom Woolman
twoo|m@n @end|ng |rom ont@rgettek@com
Mon May 11 01:48:04 CEST 2020
Hi everyone. I'm using a random forest in R to successfully perform a
classification on a dichotomous DV in a dataset that has 29 IVs of
type double and approximately 285,000 records. I ran my model on a
70/30 train/test split of the original dataset.
I'm trying to use the rfUtilities package for rf model selection and
performance evaluation, in order to generate a p-value and other
quantitative performance statistics for use in hypothesis testing,
similar to what I would do with a logistic regression glm model.
The initial random forest model results and OOB error estimates were
as follows:
randomForest(formula = Class ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 0.04%
Confusion matrix:
0 1 class.error
0 199004 16 8.039393e-05
1 73 271 2.122093e-01
I'm running this model on my laptop (Win10, 8 GB RAM) as I don't have
access to my server during the pandemic. The rfUtilities function call
works (or at least it doesn't give me an error message or crash), but
it's been running for over a day in RStudio on the original rf model
and the training dataset without providing any results.
For anyone who has used the rfUtilities package before, is this just
too large of a dataframe for a Win10 laptop to process effectively or
should I be doing something different? This is my first time using the
rfUtilities package and I understand that it is relatively new.
The function call for the rfUtilities function rf.significance is as
follows (rf is my original random forest data model from the
randomForest function):
rf.perm <- rf.significance(rf, train[,1:29], nperm=99, ntree=500)
Thanks in advance.
Tom Woolman
PhD student, Indiana State University
More information about the R-help
mailing list