[R] Error on random forest variable importance estimates

Pierre Dubath Pierre.Dubath at unige.ch
Fri Aug 6 16:17:12 CEST 2010


Hello,

I am using the R randomForest package to classify variable stars. I have 
a training set of 1755 stars described by (too) many variables. Some of 
these variables are highly correlated.

I believe that I understand how randomForest works and how the variable 
importance are evaluated (through variable permutations). Here are my 
questions.

1) variable importance error? Is there any ways to estimate the error on 
the "MeanDecreaseAccuracy"? In other words, I would like to know how 
significant are "MeanDecreaseAccuracy" differences (and display 
horizontal error bars in the VarImpPlot output).

I have notice that even with relatively large number of trees, I have 
variation in the importance values from one run to the next. Could this 
serve as a measure of the errors/uncertainties?

2) how to deal with variable correlation? so far, I am iterating, 
selecting the most important variable first, removing all other variable 
that have a high correlation (say higher than 80%), taking the second 
most important variable left, removing variables with high-correlation 
with any of the first two variables, and so on... (also using some 
astronomical insight as to which variables are the most important!)

Is there a better way to deal with correlation in randomForest? (I 
suppose that using many correlated variables should not be a problem for 
randomForest, but it is for my understanding of the data and for other 
algorithms).

3) How many variables should eventually be used? I have made successive 
runs, adding one variable at a time from the most to the least important 
(not-too-correlated) variables. I then plot the error rate (err.rate) as 
a function of the number of variable used. As this number increase, the 
error first sharply decrease, but relatively soon it reaches a plateau .
I assume that the point of inflexion can be use to derive the minimum 
number of variable to be used. Is that a sensible approach? Is there any 
other suggestion? A measure of the error on "err.rate" would also here 
really help. Is there any idea how to estimate this? From the variation 
between runs or with the help of "importanceSD" somehow?

Thanks very much in advance for any help.

Pierre Dubath



More information about the R-help mailing list