[R] CART vs. Random Forest

Thu Sep 26 18:37:04 CEST 2002

I am currently working on zooplankton identification after image analysis. I
selected both random forest and double-bagging (rpart + lda combined, R
library Ipred) as the two methods for classification. I notice also that
performances of both methods are changing with the "quality" of the picture
and also the relative proportion of the 8 groups made (27 variables,
training set of around 2000 individuals). Double-bagging is more performant
with any sample that is very similar to the training set (proportions of
groups, contrast of the picture), but its performances fall down much faster
than random forest for less good quality pictures or different proportions.
By comparing results issued from both methods, it is possible to tag
'suspect' individuals, that is, items ranged into different categories by
both methods. If these suspect items are eliminated, the correct
identification remains surprisignly constant, around 80%. I still do not
understand everything of what happens. If someone has an idea? Anyway, this
could be something to try also on your data set.
Best,

Philippe Grosjean

-----Original Message-----
From: owner-r-help at stat.math.ethz.ch
[mailto:owner-r-help at stat.math.ethz.ch]On Behalf Of Wiener, Matthew
Sent: jeudi 26 septembre 2002 2:51
To: 'Andrew Baek'; r-help at stat.math.ethz.ch
Subject: RE: [R] CART vs. Random Forest

If either method were just guessing 0 to reduce the error rate, shouldn't
they achieve a 1/34 ~ 3% or 1/100 = 1% error rate in the last two examples?
And for that matter 20% and 10%  in the first two?  It doesn't look like
that's what's going on.

One suggestion if making sure you find the 1's is more important than having
a low overall error rate:  in rpart, you can specify a loss matrix to say
that certain kinds of errors are more important than others.  In a random
forest, you can use different voting thresholds for "1-ness" and "0-ness" to
bias things -- that is, instead of just taking majority vote, you might
require (for example) 85% of the trees to agree for something to be declared
in class 0.

It's hard to say much more without knowing anything about your data.  But in
my experience random forests have substantially outperformed single trees in
many problems (and I haven't yet encountered one in which a single tree
outperformed a random forest).

Hope this helps,

Matthew Wiener
RY84-202
Applied Computer Science & Mathematics Dept.
Merck Research Labs
126 E. Lincoln Ave.
Rahway, NJ 07065
732-594-5303

-----Original Message-----
From: Andrew Baek [mailto:andrew at stat.ucla.edu]
Sent: Wednesday, September 25, 2002 3:52 PM
To: r-help at stat.math.ethz.ch
Subject: [R] CART vs. Random Forest

According to Dr. Breiman, the RF should be more accurate
method than a single tree. However, the performance of each
method seems to depend on the proprotion of outcome variable
in my case. My data set is a typical classification problem
(predict bad guys). When I ran both of them with different
proportion of outcome variables(there's a criterion to measure
the degree of bad behavior), I got very strange results.

1. proportion of 1 to 0 = 1:4
err.rate of CART = 25.2%
err.rate of RF = 25.6%

2. 1:9
err.rate of CART = 28.5%
err.rate of RF = 21.2%

3. 1:33
err.rate of CART = 28.2%
err.rate of RF = 12.1%

4. 1:99
err.rate of CART = 25.1%
err.rate of RF = 7.3%

In 3 & 4, RF looks superior to CART. But I'm afraid RF just
vote for "0" to reduce the error rate. Any suggestions?
Thank you.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

----------------------------------------------------------------------------
--
Notice: This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
may be confidential, proprietary copyrighted and/or legally privileged, and
is intended solely for the use of the individual or entity named on this
message. If you are not the intended recipient, and have received this
message in error, please immediately return this by e-mail and then delete
it.

============================================================================
==

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._