[BioC] RandomForest, supervised machine learning and uncertainty

Wed Dec 8 11:43:43 CET 2010

Dear all,

I am using RandomForests for supervised machine learning. My set of
biomarkers is quite good at distinguishing the samples from different
classes.

However, I would get an even better classification if I could
introduce a class of "Unknown" or "Unclassified" samples. Given that
alrf is the RF object

alrf <- randomForest( group ~ ., data=all )

I take a look at the matrix alrf$votes. I notice that in almost all
the misclassified cases, the votes were close to a tie; there were
also some correctly classified cases close to a tie.

If I define an additional group called "Undefined", this group will be
larger than the percentage of missclassified cases (as some correctly
annotated cases will go into that class). However, the error rate
*outside* of the class will be almost negligible. From a purely
pragmatic point of view in biomarker discovery such a situation is
preferable: it's better to admit that you don't know something than to
risk a misclassification.

And here is my question:

Is there a standard method of creating such a class?  For example, for
a given sample i, I use sum( ( votes[i,] - max( votes[i,] ) )^2 ) or
the difference between the two top votes for a given sample. But I
think that this approach is not sufficient.

Best regards,

j.

-- 
-------- Dr. January Weiner 3 --------------------------------------
Max Planck Institute for Infection Biology
Charitéplatz 1
D-10117 Berlin, Germany
Web   : www.mpiib-berlin.mpg.de
Tel     : +49-30-28460514