[R] randomForest: predictor importance (for regressions)
Dimitri Liakhovitski
ld7631 at gmail.com
Thu May 6 16:46:27 CEST 2010
>> Andy, I'll explain why I am asking. I probably should have
>> done it in the beginning:
>> I am asking not in order to figure out how to do it. I am
>> asking in order to figure something that' was done around
>> November 01, 2008.
>> Back then, a piece of code was run where from the object of
>> randomForest(.... importance=T...) the importances
>> ($importance) were extracted (just by referring to
>> $importance) and the first column was used.
>> Do you happen to know what they were back then? Standardized or not?
>
> The change coincided with the introduction of the importanceSD component, due to the change in how the importance is measured. The "importance" component are just mean(d[i]), and importanceSD are sd(d[i])/sqrt(ntree). The importance() function by default (scale=TRUE) does the normalization, and that's what you should use. Leo found that this normalization will greatly reduce the "bias" due to different number of possible splits in different predictors.
Actually, it looks like if one extracts incorrectly (by looking just
at $importance) - then one gets unscaled results. Hope it was the same
in 2008.
I've just run an example randomForest for a case with 6 predictors
(importance = T). My randomForest object is "rftrest."
Below are some results:
Looking at importances the way it was done in November 2008:
as.data.frame(rftest$importance)[1]
I am getting:
%IncMSE
v1 1.3900833
v2 1.2219338
v3 0.6337521
v4 1.4101760
v5 1.4474130
v6 0.7583074
Extracting as you recommended one should - looking for unscaled
results: importance(rftest, scale=F)
I am getting exactly the same results as above:
%IncMSE IncNodePurity
v1 1.3900833 147.31267
v2 1.2219338 147.51669
v3 0.6337521 97.11210
v4 1.4101760 149.48934
v5 1.4474130 149.61458
v6 0.7583074 97.74933
Now, I am extracting scaled importances: importance(rftest, scale=T)
I am getting:
%IncMSE IncNodePurity
v1 16.97155 147.31267
v2 17.04288 147.51669
v3 10.19135 97.11210
v4 18.22732 149.48934
v5 18.36879 149.61458
v6 10.46555 97.74933
This is the same as what I get when I do this the way it was done in
2008: as.data.frame(rftest$importance)[1]/as.data.frame(rftest$importanceSD)
Resulting in:
%IncMSE
v1 16.97155
v2 17.04288
v3 10.19135
v4 18.22732
v5 18.36879
v6 10.46555
Dimitri
More information about the R-help
mailing list