[BioC] RandomForest, supervised machine learning and uncertainty

Vincent Carey stvjc at channing.harvard.edu
Wed Dec 8 13:07:59 CET 2010

On Wed, Dec 8, 2010 at 5:43 AM, January Weiner
<january.weiner at mpiib-berlin.mpg.de> wrote:
> Dear all,
> I am using RandomForests for supervised machine learning. My set of
> biomarkers is quite good at distinguishing the samples from different
> classes.
> However, I would get an even better classification if I could
> introduce a class of "Unknown" or "Unclassified" samples. Given that
> alrf is the RF object
> alrf <- randomForest( group ~ ., data=all )
> I take a look at the matrix alrf$votes. I notice that in almost all
> the misclassified cases, the votes were close to a tie; there were
> also some correctly classified cases close to a tie.
> If I define an additional group called "Undefined", this group will be
> larger than the percentage of missclassified cases (as some correctly
> annotated cases will go into that class). However, the error rate
> *outside* of the class will be almost negligible. From a purely
> pragmatic point of view in biomarker discovery such a situation is
> preferable: it's better to admit that you don't know something than to
> risk a misclassification.
> And here is my question:
> Is there a standard method of creating such a class?  For example, for
> a given sample i, I use sum( ( votes[i,] - max( votes[i,] ) )^2 ) or
> the difference between the two top votes for a given sample. But I
> think that this approach is not sufficient.

I don't think there is anything like a "standard method" for this
task, but if I read you correctly you are addressing the extension of
the decision task from two classes to two classes plus "doubt".  This
is discussed at some length in Ripley's "Pattern Recognition and
Neural Networks" book; see the comments on the "error-reject" curve on
p20 and on "safety threshold" concept on p22.

The MLInterfaces vignette has an application (that, as written, turns
out to be nugatory) just at the end of the vignette -- the doubt
interval is too narrow to capture any classification for the data in
use.  If you change the code to

douPred[smallDou(0.35, 0.65)] <- "doubt"

one prediction is converted to "doubt".  This issue deserves more attention.

> Best regards,
> j.
> --
> -------- Dr. January Weiner 3 --------------------------------------
> Max Planck Institute for Infection Biology
> Charitéplatz 1
> D-10117 Berlin, Germany
> Web   : www.mpiib-berlin.mpg.de
> Tel     : +49-30-28460514
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list