[R] randomForest: predictor importance (for regressions)
Liaw, Andy
andy_liaw at merck.com
Thu May 6 17:06:14 CEST 2010
From: Dimitri Liakhovitski
> >> Andy, I'll explain why I am asking. I probably should have
> done it in
> >> the beginning:
> >> I am asking not in order to figure out how to do it. I am
> asking in
> >> order to figure something that' was done around November 01, 2008.
> >> Back then, a piece of code was run where from the object of
> >> randomForest(.... importance=T...) the importances
> >> ($importance) were extracted (just by referring to
> >> $importance) and the first column was used.
> >> Do you happen to know what they were back then?
> Standardized or not?
> >
> > The change coincided with the introduction of the
> importanceSD component, due to the change in how the
> importance is measured. The "importance" component are just
> mean(d[i]), and importanceSD are sd(d[i])/sqrt(ntree). The
> importance() function by default (scale=TRUE) does the
> normalization, and that's what you should use. Leo found
> that this normalization will greatly reduce the "bias" due to
> different number of possible splits in different predictors.
>
> Actually, it looks like if one extracts incorrectly (by
> looking just at $importance) - then one gets unscaled
> results. Hope it was the same in 2008.
Yes. The NEWS file (what you see when you type rfNews()) shows the following for version 4.3-0:
* The `importance' component of randomForest object has been changed:
The permutation-based measures are not divided by their `standard
errors'. Instead, the `standard errors' are stored in the
`importanceSD' component. One should use the importance() extractor
function rather than something like rf.obj$importance for extracting
the importance measures.
and version 4.3-0 is dated 2004-07-07.
Andy
> I've just run an example randomForest for a case with 6
> predictors (importance = T). My randomForest object is "rftrest."
> Below are some results:
>
> Looking at importances the way it was done in November 2008:
> as.data.frame(rftest$importance)[1]
> I am getting:
>
> %IncMSE
> v1 1.3900833
> v2 1.2219338
> v3 0.6337521
> v4 1.4101760
> v5 1.4474130
> v6 0.7583074
>
> Extracting as you recommended one should - looking for unscaled
> results: importance(rftest, scale=F)
> I am getting exactly the same results as above:
>
> %IncMSE IncNodePurity
> v1 1.3900833 147.31267
> v2 1.2219338 147.51669
> v3 0.6337521 97.11210
> v4 1.4101760 149.48934
> v5 1.4474130 149.61458
> v6 0.7583074 97.74933
>
> Now, I am extracting scaled importances: importance(rftest,
> scale=T) I am getting:
>
> %IncMSE IncNodePurity
> v1 16.97155 147.31267
> v2 17.04288 147.51669
> v3 10.19135 97.11210
> v4 18.22732 149.48934
> v5 18.36879 149.61458
> v6 10.46555 97.74933
>
> This is the same as what I get when I do this the way it was done in
> 2008:
> as.data.frame(rftest$importance)[1]/as.data.frame(rftest$importanceSD)
> Resulting in:
>
> %IncMSE
> v1 16.97155
> v2 17.04288
> v3 10.19135
> v4 18.22732
> v5 18.36879
> v6 10.46555
>
> Dimitri
>
Notice: This e-mail message, together with any attachme...{{dropped:11}}
More information about the R-help
mailing list