[R] randomForest: predictor importance (for regressions)

Thu May 6 16:46:27 CEST 2010

>> Andy, I'll explain why I am asking. I probably should have
>> done it in the beginning:
>> I am asking not in order to figure out how to do it. I am
>> asking in order to figure something that' was done around
>> November 01, 2008.
>> Back then, a piece of code was run where from the object of
>> randomForest(.... importance=T...) the importances
>> ($importance) were extracted (just by referring to
>> $importance) and the first column was used.
>> Do you happen to know what they were back then? Standardized or not?
>
> The change coincided with the introduction of the importanceSD component, due to the change in how the importance is measured.  The "importance" component are just mean(d[i]), and importanceSD are sd(d[i])/sqrt(ntree).  The importance() function by default (scale=TRUE) does the normalization, and that's what you should use.  Leo found that this normalization will greatly reduce the "bias" due to different number of possible splits in different predictors.

Actually, it looks like if one extracts incorrectly (by looking just
at $importance) - then one gets unscaled results. Hope it was the same
in 2008.

I've just run an example randomForest for a case with 6 predictors
(importance = T). My randomForest object is "rftrest."
Below are some results:

Looking at importances the way it was done in November 2008:
as.data.frame(rftest$importance)[1]
I am getting:

 %IncMSE
v1 1.3900833
v2 1.2219338
v3 0.6337521
v4 1.4101760
v5 1.4474130
v6 0.7583074

Extracting as you recommended one should - looking for unscaled
results:  importance(rftest, scale=F)
I am getting exactly the same results as above:

     %IncMSE IncNodePurity
v1 1.3900833     147.31267
v2 1.2219338     147.51669
v3 0.6337521      97.11210
v4 1.4101760     149.48934
v5 1.4474130     149.61458
v6 0.7583074      97.74933

Now, I am extracting scaled importances:  importance(rftest, scale=T)
I am getting:

    %IncMSE IncNodePurity
v1 16.97155     147.31267
v2 17.04288     147.51669
v3 10.19135      97.11210
v4 18.22732     149.48934
v5 18.36879     149.61458
v6 10.46555      97.74933

This is the same as what I get when I do this the way it was done in
2008:  as.data.frame(rftest$importance)[1]/as.data.frame(rftest$importanceSD)
Resulting in:

    %IncMSE
v1 16.97155
v2 17.04288
v3 10.19135
v4 18.22732
v5 18.36879
v6 10.46555

Dimitri