[R] some question regarding random forest
Rajarshi Guha
rxg218 at psu.edu
Tue Mar 2 01:00:24 CET 2004
Hi,
I had two questions regarding random forests for regression.
1) I have read the original paper by Breiman as well as a paper
dicussing an application of random forests and it appears that the one
of the nice features of this technique is good predictive ability.
However I have some data with which I have generated a linear model
using lm(). I can get an RMS error of 0.43 and an R^2 of 0.62.
However when I make a plot of predicted versus observed using the
randomForest() function the plot is much more scattered (RMS error of
0.55 and R^2 of 0.33) than for a similar plot using the linear model.
(When a test set is supplied to the models the R^2 values are close).
My question is: should I expect the randomForest to give me similar or
better results than a simple linear model? In the above case I was
expecting that for the training data (ie the data with which the random
forest was built) I would get less scatter in the plot and a lower RMSE.
(I realize that too much stock should'nt be placed in R^2).
The papers note that overfitting is not a problem with random forests
and so I was wondering what I could do to improve the results -I've
tried playing with the number of trees and the value of m_try but I dont
see much change.
Is there anything that I can do to improve the results for a random
forest model? (Are there any signifcant papers, apart from Breiman, that
I should be reading related to random forests?)
2) My second question is related to interpretation of the variable
importance plot using var.imp.plot(). I realise that the variables are
ordered in order of decreasing importance. However for example I see
that there is a large decrease in the value of Importance from the first
variable (ie most important) to the second one. Whereas for other pairs
the difference in the Importance value is not so large.
Is the difference between the Importance value a measure of 'how much
more important' a variable is? Or am I going in the wrong direction?
In additionn, is there any sort of rule or heuristic that can be used to
say for example that the first N variables account for the model? Or is
the interpretation of variable importance descriptive in nature?
Thanks,
-------------------------------------------------------------------
Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
Science kind of takes the fun out of the portent business.
-Hobbes
More information about the R-help
mailing list