[R] simplifying randomForest(s)
Ramon Diaz-Uriarte
rdiaz at cnio.es
Tue Sep 16 16:31:19 CEST 2003
Dear Andy,
Thanks a lot for your message.
> This is quite a hazardous game. We've been burned by this ourselves. I'll
> send you a paper we submitted on variable selection for random forest
> off-line. (Those who are interested, let me know.)
Thanks!
>
> The basic problem is that when you select important variables by RF and
> then re-run RF with those variables, the OOB error rate become biased
> downward. As you iterate more times, the "overfitting" becomes more and
> more severe (in the sense that, the OOB error rate will keep decreasing
> while error rate on an independent test set will be flat or increases). I
> was naïve enough to ask Breiman about this, and his reply was something
> like "any competent statistician would know that you need something like
> cross-validation to do that"...
Yes, I understand the points you are making. However, I have tried to achieve
protection against this problem by assessing the leave-one-out
cross-validation error (LOOCVE) of the complete selection process. And the
LOOCVE suggests this is working. Within the variable selection routine the
OOB error rate is biased, but I guess that does not concern me that much,
because I only use it to guide the selection. However, my final estimate of
error comes from the LOOCVE.
This is the esqueleton of the alorithm:
n <- length(y)
for(i in 1:n) {
the.simple.rf <- simplify.the.rf(data = data[-i, ])
prediction[i] <- predict(the.simple.rf, newdata = data[i, ])
}
loocve <- sum(y != prediction) / n
Thus, the LOOCVE is computed with observations that were never used for the
simplification of the tree that is predicting them.
[I'll be glad to send my code to anyone interested].
And, the interesting thing with the data set I have tried is that it seems to
perform reasonably (actually, the LOOCVE of a tree with the reduced set of
variables is smaller than the LOOCVE of the original tree).
(This is a first shot. I have a small sample size (29) so LOOCV is not that
bad in terms of computation, although I am aware it can have high variance. I
guess I could try the .632+ bootstrap method).
Best,
Ramón
>
> Best,
> Andy
>
> > Any suggestions/comments?
> >
> > Best,
> >
> > Ramón
> >
> > --
> > Ramón Díaz-Uriarte
> > Bioinformatics Unit
> > Centro Nacional de Investigaciones Oncológicas (CNIO)
> > (Spanish National Cancer Center)
> > Melchor Fernández Almagro, 3
> > 28029 Madrid (Spain)
> > Fax: +-34-91-224-6972
> > Phone: +-34-91-224-6900
>
> http://bioinfo.cnio.es/~rdiaz
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
> ---------------------------------------------------------------------------
>--- Notice: This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA),
> and/or its affiliates (which may be known outside the United States as
> Merck Frosst, Merck Sharp & Dohme or MSD) that may be confidential,
> proprietary copyrighted and/or legally privileged, and is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error, please
> immediately return this by e-mail and then delete it.
> ---------------------------------------------------------------------------
>---
--
Ramón Díaz-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900
http://bioinfo.cnio.es/~rdiaz
More information about the R-help
mailing list