[R] randomForest: predictor importance (for regressions)

Dimitri Liakhovitski ld7631 at gmail.com
Thu May 6 16:04:11 CEST 2010


Thank you very much, Andy.
I did turn off HTML - hope it'll solve the problem!

> Andy, but it is the FIRST column in $importance (not the SECOND) that is
> labeled "%IncMSE". The second column is labeled "IncNodePurity". So, I
> am confused - which one is the mean decrease in accuracy?
> Or, maybe I should ask again: In a case of regression trees, which of
> the two columns in $importance contains the predictor importances
> calculated by randomly permuting values and looking at how much worse
> the prediction has become?
> I assume it's the first column (labeled "%IncMSE"). Is this correct?
>
> [AL]: Note I said "reduction in node impurity", which is another way of
> saying "increase in node purity" 8-).  I should think from the help page
> for importance() it should be clear which is which.  When you permute
> the value of a variable in OOB data and make prediction, the expectation
> is that the MSE will increase, especially if the variable has some
> importance, thus the label "%IncMSE".  Why do you need to assume?

Great, thanks for confirming!


> [AL]: As I said, you are recommended to use importance() to extract
> variable importance.  The recommendation is for avoiding confusions like
> yours.  If you want to know what the components in the objects give you,
> compare to what the extractor function returns, you can look inside the
> extractor function to find out for yourself.  Really, I'm not trying to
> be difficult, but there are very good reasons for not accessing the
> components directly when extractor functions exist.  If the underlying
> components are somehow changed in the future, only the extractor
> functions are guaranteed to give you the "right thing".  I added the
> extractor function for importance measures precisely because the way
> they are computed changed.

Andy, I'll explain why I am asking. I probably should have done it in
the beginning:
I am asking not in order to figure out how to do it. I am asking in
order to figure something that' was done around November 01, 2008.
Back then, a piece of code was run where from the object of
randomForest(.... importance=T...) the importances ($importance) were
extracted (just by referring to $importance) and the first column was
used.
Do you happen to know what they were back then? Standardized or not?

Thank you!
Dimitri



More information about the R-help mailing list