[R] random forest -optimising mtry
Ute
falaise at web.de
Wed Oct 13 12:08:24 CEST 2004
Dear R-helpers,
I'm working on mass spectra in randomForest/R, and following the
recommendations for the case of noisy variables, I don't want to use the
default mtry (sqrt of nvariables), but I'm not sure up to which
proportion mtry/nvariables it makes sense to increase mtry without
"overtuning" RF.
Let me tell my example: I have 106 spectra belonging to 4 classes, the
number of variables is 172. I'm interested in finding information about
variables (importance, split points etc.) and proximities.
First I ran a forest with mtry =30 and ntree=2500. The result was an
oob-estimate of overall error rate of zero, perfect classification. In
order to explore my results, I calculated the average proximity between
the classes. I got:
> res
op12 op13 op14 op23
op24 op34
[1,] 0.06145473 0.1369406 0.08036264 0.06171053 0.1113126 0.06732087
For me, the important meaning of these values is that from comparision
of class 1 and 3, as well as class 2 and 4 result more common features
than from other comparisions. I have worked yet a lot about these data,
I have looked a lot on my spectra, and I believe these proximities to be
realistic.
Then I ran the tune RF function(step factor 1.5), I got out an mtry=63.
A new forest having this mtry and 2500 trees gave me perfect
classification as well, but the relation between proximitiy values
changed a lot:
res
op12 op13 op14 op23
op24 op34
[1,] 0.1092702 0.117489 0.09696328 0.08725208 0.08495621 0.06506148
This is what makes me think that I have overtuned my second forest...So
how should I choose mtry?
Best regards,
Ute
More information about the R-help
mailing list