[R] Error on random forest variable importance estimates
Pierre Dubath
Pierre.Dubath at unige.ch
Sun Aug 8 15:01:48 CEST 2010
Hello Andy,
Thank you for your quick and helpful reply. I will try to follow your
suggestions.
Also, thank you for the R implementation of random forest. It is very
useful for our work.
Best,
Pierre
Liaw, Andy wrote:
> From: Pierre Dubath
>> Hello,
>>
>> I am using the R randomForest package to classify variable
>> stars. I have
>> a training set of 1755 stars described by (too) many
>> variables. Some of
>> these variables are highly correlated.
>>
>> I believe that I understand how randomForest works and how
>> the variable
>> importance are evaluated (through variable permutations). Here are my
>> questions.
>>
>> 1) variable importance error? Is there any ways to estimate
>> the error on
>> the "MeanDecreaseAccuracy"? In other words, I would like to know how
>> significant are "MeanDecreaseAccuracy" differences (and display
>> horizontal error bars in the VarImpPlot output).
>
> If you really want to do it, one possibility is to do permutation test:
> Permute your response, say, 1000 or 2000 times, run RF on each of these
> permuted response, and use the importance measures as samples from the
> null distribution.
>
>> I have notice that even with relatively large number of trees, I have
>> variation in the importance values from one run to the next.
>> Could this
>> serve as a measure of the errors/uncertainties?
>
> Yes.
>
>> 2) how to deal with variable correlation? so far, I am iterating,
>> selecting the most important variable first, removing all
>> other variable
>> that have a high correlation (say higher than 80%), taking the second
>> most important variable left, removing variables with
>> high-correlation
>> with any of the first two variables, and so on... (also using some
>> astronomical insight as to which variables are the most important!)
>>
>> Is there a better way to deal with correlation in randomForest? (I
>> suppose that using many correlated variables should not be a
>> problem for
>> randomForest, but it is for my understanding of the data and
>> for other
>> algorithms).
>
> That depends a lot on what you're trying to do. RF can tolerate
> problematic data, but that doesn't mean it will magically give you good
> answers. Trying to draw conclusions about effects when there are highly
> correlated (and worse, important) variables is a tricky business.
>
>> 3) How many variables should eventually be used? I have made
>> successive
>> runs, adding one variable at a time from the most to the
>> least important
>> (not-too-correlated) variables. I then plot the error rate
>> (err.rate) as
>> a function of the number of variable used. As this number
>> increase, the
>> error first sharply decrease, but relatively soon it reaches
>> a plateau .
>> I assume that the point of inflexion can be use to derive the minimum
>> number of variable to be used. Is that a sensible approach?
>> Is there any
>> other suggestion? A measure of the error on "err.rate" would
>> also here
>> really help. Is there any idea how to estimate this? From the
>> variation
>> between runs or with the help of "importanceSD" somehow?
>
> One approach is described in the following paper (in the Proceedings of
> MCS 2004):
> http://www.springerlink.com/content/9n61mquugf9tungl/
>
> Best,
> Andy
>
>> Thanks very much in advance for any help.
>>
>> Pierre Dubath
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> Notice: This e-mail message, together with any attach...{{dropped:13}}
More information about the R-help
mailing list