[R] Random Forest AUC

Fri Oct 22 16:38:48 CEST 2010

Ravishankar,

> I used Random Forest with a couple of data sets I had to predict for binary
> response. In all the cases, the AUC of the training set is coming to be 1.
> Is this always the case with random forests? Can someone please clarify
> this?

This is pretty typical for this model.

> I have given a simple example, first using logistic regression and then
> using random forests to explain the problem. AUC of the random forest is
> coming out to be 1.

Logistic regression isn't as flexible as RF and some other methods, so
the ROC curve is likely to be less than one, but much higher than it
really is (since you are re-predicting the same data)

For you example:

> performance(prediction(train.predict,iris$Species),"auc")@y.values[[1]]
[1] 0.9972

but using simple 10-fold CV:

> library(caret)
> ctrl <- trainControl(method = "cv",
+                      classProbs = TRUE,
+                      summaryFunction = twoClassSummary)
>
> set.seed(1)
> cvEstimate <- train(Species ~ ., data = iris,
+                     method = "glm",
+                     metric = "ROC",
+                     trControl = ctrl)
Fitting: parameter=none
Aggregating results
Fitting model on full training set
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: algorithm did not converge
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: algorithm did not converge
5: glm.fit: fitted probabilities numerically 0 or 1 occurred
> cvEstimate

Call:
train.formula(form = Species ~ ., data = iris, method = "glm",
    metric = "ROC", trControl = ctrl)

100 samples
  4 predictors

Pre-processing:
Resampling: Cross-Validation (10 fold)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results

  Sens  Spec  ROC   Sens SD  Spec SD  ROC SD
  0.96  0.98  0.86  0.0843   0.0632   0.126

and for random forest:

> set.seed(1)
> rfEstimate <- train(Species ~ .,
+                     data = iris,
+                     method = "rf",
+                     metric = "ROC",
+                     tuneGrid = data.frame(.mtry = 2),
+                     trControl = ctrl)
Fitting: mtry=2
Aggregating results
Selecting tuning parameters
Fitting model on full training set
> rfEstimate

Call:
train.formula(form = Species ~ ., data = iris, method = "rf",
    metric = "ROC", tuneGrid = data.frame(.mtry = 2), trControl = ctrl)

100 samples
  4 predictors

Pre-processing:
Resampling: Cross-Validation (10 fold)

Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...

Resampling results

  Sens  Spec  ROC    Sens SD  Spec SD  ROC SD
  0.94  0.92  0.898  0.0966   0.14     0.00632

Tuning parameter 'mtry' was held constant at a value of 2

-- 

Max