[R] imbalanced classes

Thu Jan 26 03:00:51 CET 2006

Mark,

I guess the message is meant for me (yet you sent it to R-help).

If you have 10 class A and 100 class B, not setting sampsize would cause a
random sample (with replacement) of 110 from the whole lot, which, of
course, would give you on the average 10 times more Bs than As in the
sample.  If you grow a tree on such a sample, it's not going to do so well
in predicting the As.  However, if you set sampsize=c(10, 10), then each
tree is grown on 10 randomly sampled As and 10 randomly sampled Bs, giving
the tree a much better chance of giving roughly similar error rates for
predicting As and Bs.  If setting the sampsize to be equal doesn't quite do
it, you can try setting it to the more extreme direction.

As to cutoff, in a two-class problem, it's the same as setting the
classification threshold to something other than 0.5.  E.g., if
cutoff=c(0.9, 0.1), then a case with 80% of the votes for class A would
still be classified as B, because .8/.9 < .2/.1.  Hope that's clear.

I do have to wonder, though, if you only have a total of 37 cases in the
data, how can you be sure the estimates of class error rates you get will
pan out on a larger test set?  I would think the variability on the estimate
of the class error rates is so high that it doesn't make too much sense to
try to balance them too much...  Just my $0.02.

I do plan on implementing the weighted RF (see the To Do part of rfNews()),
but don't hold your breath...

Cheers,
Andy

From: Mark D'Ascenzo
> 
> Hi Andy,
> 
> I know this topic has been discussed before on the R-help, but I was
> wondering if you could offer some advice specific to my application.
> 
> I'm using the R random forest package to compare two classes of data,
> the number of cases in each class relatively low, 28 in class 1 and 9
> in class 2. I'd really like to use R environment to analyze this data,
> however I'm finding it difficult to put much trust in the results of
> my analysis.  As you've stated, the classwt variables do not do much,
> and I've tried working with the cuttoff and sampsize variables as
> well, with limited success in balancing error rates between the two
> classes.
> 
> It was unclear to me how to use the cuttoff parameter correctly.  If
> you have any recommendations here, it would be appreciated. 
> Additionally with the sampsize variable, I have tried a few values,
> for example setting sampsize = c(2, 6) and c(9, 3), etc.  It wasn't
> clear to me if I should be sampling more from the larger class or the
> other way around.
> 
> Lastly, I'm wondering if you are currently working or have plans to
> release in the near future an R version of randomForest that is
> equivalent to the FORTRAN rf5 package.  It works wonderfully for my
> application, but getting data in and out of it, changing parameters,
> compiling is just a pain, as I'm sure you agree.
> 
> Your thoughts would be greatly appreciated.
> 
> Kind regards,
> 
> Mark D'Ascenzo
> Biomedical Engineering
> Cornell University
> Ithaca, NY 14853
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>