[R] random forest -optimising mtry

Wed Oct 13 12:08:24 CEST 2004

Dear R-helpers,

I'm working on mass spectra in randomForest/R, and following the 
recommendations for the case of noisy variables, I don't want to use the 
default mtry (sqrt of nvariables), but I'm not sure up to which 
proportion mtry/nvariables it makes sense to increase mtry without 
"overtuning" RF.
Let me tell my example: I have 106 spectra belonging to 4 classes, the 
number of variables is 172. I'm interested in finding information about 
variables (importance, split points etc.) and proximities.
First I  ran a forest with mtry =30 and ntree=2500. The result was an 
oob-estimate of overall error rate of zero, perfect classification.  In 
order to explore my results, I calculated the average proximity between 
the classes. I got:
 > res
           op12          op13           op14           op23          
op24          op34
[1,] 0.06145473 0.1369406 0.08036264 0.06171053 0.1113126 0.06732087
For me, the important meaning of these values is that from comparision 
of class 1 and 3, as well as class 2 and 4 result more common features 
than from other comparisions. I have worked yet a lot about these data, 
I have looked a lot on my spectra, and I believe these proximities to be 
realistic.

Then I ran the tune RF function(step factor 1.5), I got out an mtry=63. 
A new forest having this mtry and 2500 trees gave me perfect 
classification as well, but the relation between proximitiy values 
changed a lot:
res
          op12         op13       op14           op23               
op24       op34
[1,] 0.1092702 0.117489 0.09696328 0.08725208 0.08495621 0.06506148

This is what makes me think that I have overtuned my second forest...So 
how should I choose mtry?

Best regards,
Ute