[R] Random Forest AUC
Max Kuhn
mxkuhn at gmail.com
Fri Oct 22 16:38:48 CEST 2010
Ravishankar,
> I used Random Forest with a couple of data sets I had to predict for binary
> response. In all the cases, the AUC of the training set is coming to be 1.
> Is this always the case with random forests? Can someone please clarify
> this?
This is pretty typical for this model.
> I have given a simple example, first using logistic regression and then
> using random forests to explain the problem. AUC of the random forest is
> coming out to be 1.
Logistic regression isn't as flexible as RF and some other methods, so
the ROC curve is likely to be less than one, but much higher than it
really is (since you are re-predicting the same data)
For you example:
> performance(prediction(train.predict,iris$Species),"auc")@y.values[[1]]
[1] 0.9972
but using simple 10-fold CV:
> library(caret)
> ctrl <- trainControl(method = "cv",
+ classProbs = TRUE,
+ summaryFunction = twoClassSummary)
>
> set.seed(1)
> cvEstimate <- train(Species ~ ., data = iris,
+ method = "glm",
+ metric = "ROC",
+ trControl = ctrl)
Fitting: parameter=none
Aggregating results
Fitting model on full training set
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: algorithm did not converge
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: algorithm did not converge
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
> cvEstimate
Call:
train.formula(form = Species ~ ., data = iris, method = "glm",
metric = "ROC", trControl = ctrl)
100 samples
4 predictors
Pre-processing:
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
Resampling results
Sens Spec ROC Sens SD Spec SD ROC SD
0.96 0.98 0.86 0.0843 0.0632 0.126
and for random forest:
> set.seed(1)
> rfEstimate <- train(Species ~ .,
+ data = iris,
+ method = "rf",
+ metric = "ROC",
+ tuneGrid = data.frame(.mtry = 2),
+ trControl = ctrl)
Fitting: mtry=2
Aggregating results
Selecting tuning parameters
Fitting model on full training set
> rfEstimate
Call:
train.formula(form = Species ~ ., data = iris, method = "rf",
metric = "ROC", tuneGrid = data.frame(.mtry = 2), trControl = ctrl)
100 samples
4 predictors
Pre-processing:
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
Resampling results
Sens Spec ROC Sens SD Spec SD ROC SD
0.94 0.92 0.898 0.0966 0.14 0.00632
Tuning parameter 'mtry' was held constant at a value of 2
--
Max
More information about the R-help
mailing list