[R] Error on random forest variable importance estimates

Sun Aug 8 15:01:48 CEST 2010

Hello Andy,

Thank you for your quick and helpful reply. I will try to follow your 
suggestions.

Also, thank you for the R implementation of random forest. It is very 
useful for our work.

Best,

Pierre

Liaw, Andy wrote:
> From: Pierre Dubath
>> Hello,
>>
>> I am using the R randomForest package to classify variable 
>> stars. I have 
>> a training set of 1755 stars described by (too) many 
>> variables. Some of 
>> these variables are highly correlated.
>>
>> I believe that I understand how randomForest works and how 
>> the variable 
>> importance are evaluated (through variable permutations). Here are my 
>> questions.
>>
>> 1) variable importance error? Is there any ways to estimate 
>> the error on 
>> the "MeanDecreaseAccuracy"? In other words, I would like to know how 
>> significant are "MeanDecreaseAccuracy" differences (and display 
>> horizontal error bars in the VarImpPlot output).
> 
> If you really want to do it, one possibility is to do permutation test:
> Permute your response, say, 1000 or 2000 times, run RF on each of these
> permuted response, and use the importance measures as samples from the
> null distribution.
>  
>> I have notice that even with relatively large number of trees, I have 
>> variation in the importance values from one run to the next. 
>> Could this 
>> serve as a measure of the errors/uncertainties?
> 
> Yes.
>  
>> 2) how to deal with variable correlation? so far, I am iterating, 
>> selecting the most important variable first, removing all 
>> other variable 
>> that have a high correlation (say higher than 80%), taking the second 
>> most important variable left, removing variables with 
>> high-correlation 
>> with any of the first two variables, and so on... (also using some 
>> astronomical insight as to which variables are the most important!)
>>
>> Is there a better way to deal with correlation in randomForest? (I 
>> suppose that using many correlated variables should not be a 
>> problem for 
>> randomForest, but it is for my understanding of the data and 
>> for other 
>> algorithms).
> 
> That depends a lot on what you're trying to do.  RF can tolerate
> problematic data, but that doesn't mean it will magically give you good
> answers.  Trying to draw conclusions about effects when there are highly
> correlated (and worse, important) variables is a tricky business.
>  
>> 3) How many variables should eventually be used? I have made 
>> successive 
>> runs, adding one variable at a time from the most to the 
>> least important 
>> (not-too-correlated) variables. I then plot the error rate 
>> (err.rate) as 
>> a function of the number of variable used. As this number 
>> increase, the 
>> error first sharply decrease, but relatively soon it reaches 
>> a plateau .
>> I assume that the point of inflexion can be use to derive the minimum 
>> number of variable to be used. Is that a sensible approach? 
>> Is there any 
>> other suggestion? A measure of the error on "err.rate" would 
>> also here 
>> really help. Is there any idea how to estimate this? From the 
>> variation 
>> between runs or with the help of "importanceSD" somehow?
> 
> One approach is described in the following paper (in the Proceedings of
> MCS 2004):
> http://www.springerlink.com/content/9n61mquugf9tungl/
> 
> Best,
> Andy
>  
>> Thanks very much in advance for any help.
>>
>> Pierre Dubath
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> Notice:  This e-mail message, together with any attach...{{dropped:13}}