[R] imbalanced classes

Thu Jan 26 00:56:24 CET 2006

Hi Andy,

I know this topic has been discussed before on the R-help, but I was
wondering if you could offer some advice specific to my application.

I'm using the R random forest package to compare two classes of data,
the number of cases in each class relatively low, 28 in class 1 and 9
in class 2. I'd really like to use R environment to analyze this data,
however I'm finding it difficult to put much trust in the results of
my analysis.  As you've stated, the classwt variables do not do much,
and I've tried working with the cuttoff and sampsize variables as
well, with limited success in balancing error rates between the two
classes.

It was unclear to me how to use the cuttoff parameter correctly.  If
you have any recommendations here, it would be appreciated. 
Additionally with the sampsize variable, I have tried a few values,
for example setting sampsize = c(2, 6) and c(9, 3), etc.  It wasn't
clear to me if I should be sampling more from the larger class or the
other way around.

Lastly, I'm wondering if you are currently working or have plans to
release in the near future an R version of randomForest that is
equivalent to the FORTRAN rf5 package.  It works wonderfully for my
application, but getting data in and out of it, changing parameters,
compiling is just a pain, as I'm sure you agree.

Your thoughts would be greatly appreciated.

Kind regards,

Mark D'Ascenzo
Biomedical Engineering
Cornell University
Ithaca, NY 14853