[R] tuning random forest. An unexpected result

Wed Nov 23 17:00:42 CET 2011

Gianni,

You should not "tune" ntree in cross-validation or other validation methods, and especially should not be using OOB MSE to do so.

1. At ntree=1, you are using only about 36% of the data to assess the performance of a single random tree.  This number can vary wildly.  I'd say don't bother looking at OOB measure of anything with ntree < 30.  If you want an exercise in probability, compute the number of trees you need to have the desired probability that all n data points are out-of-bag at least k times, and don't look at ntree < k.

2. If you just plot the randomForest object using the generic plot() function, you will see that it gives you the vector of MSEs for ntree=1 to the max.  That's why you need not use other methods such as cross-validation.

3. As mentioned in the article you cited, RF is insentive to ntree, and they settled on ntree=250.  Also as we mentioned in the R News article, "too many trees" does not degrade prediction performance, only computational cost (which is trivial even for moderate size of data set).

4. It is not wise to "optimize" parameters of a model like that.  When all of the MSE estimates are within a few percent of each other, you're likely just chasing noise in the evaluation process.

Just my $0.02...

Best,
Andy

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of gianni lavaredo
> Sent: Thursday, November 17, 2011 6:29 PM
> To: r-help at r-project.org
> Subject: [R] tuning random forest. An unexpected result
> 
> Dear Researches,
> 
> I am using RF (in regression way) for analize several metrics 
> extract from
> image. I am tuning RF setting a loop using different range of 
> mtry, tree
> and nodesize using the lower value of MSE-OOB
> 
> mtry from 1 to 5
> nodesize from1 to 10
> tree from 1 to 500
> 
> using this paper as refery
> 
> Palmer, D. S., O'Boyle, N. M., Glen, R. C., & Mitchell, J. B. 
> O. (2007).
> Random Forest Models To Predict Aqueous Solubility. Journal 
> of Chemical
> Information and Modeling, 47, 150-158.
> 
> my problem is the following using data(airquality) :
> 
> the tunning parameters with the lower value is:
> 
> > print(result.mtry.df[result.mtry.df$RMSE == 
> min(result.mtry.df$RMSE),])
> *RMSE  = 15.44751
> MSE = 238.6257
> mtry = 3
> nodesize = 5
> tree = 35*
> 
> the numer of tree is very low, different respect how i can 
> read in several
> pubblications
> 
> And the second value lower is a tunning parameters with *tree = 1*
> 
> print(head(result.mtry.df[
> with(result.mtry.df, order(MSE)), ]))
>           RMSE      MSE mtry nodesize tree
> 12035 15.44751 238.6257    3        5   35
> *18001 15.44861 238.6595    4        7    1
> *7018  16.02354 256.7539    2        5   18
> 20031 16.02536 256.8121    5        1   31
> 11037 16.02862 256.9165    3        3   37
> 11612 16.05162 257.6544    3        4  112
> 
> i am wondering if i wrong in the setting or there are some 
> aspects i don't
> conseder.
> thanks for attention and thanks in advance for suggestions and help
> 
> Gianni
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}