[R] RandomForest vs. bayes & svm classification performance
Jameson C. Burt
jameson at coost.com
Thu Jul 27 21:22:35 CEST 2006
With remiss, I haven't tried these R tools.
However, I tried a dozen Naive Bayes-like programs, often used to filter
email, where the serious problem with spam has resulted in many
The most touted of the worldwide Naive Bayes programs seems to be
CRM114 (not in R, I expect, since its programming is peculiar),
whose 275 pages of documentation is at
However, unless you have several weeks and some flexible programming
skills, don't consider it.
It took me about 3 months to find that crm114 worked best,
then another month to break thru his documentation to control
his program from a single Perl program with no external parameter files.
Crm114 can form groups of 5 words as word word, taking all combinations
of 5 consecutive words in documents.
Using 5 words produced better results than any filters I used; eg,
filtering/altering car manufacturer's standard form prompts like
Fire? Yes_ No_
Initially, I expected correct results of 99% or better,
like my use of Naive Bayes to filter my email.
However, email must accomplish some goal (go to their webpage or see
their low cost), so Naive Bayes approaches work very well on email.
U.S. Department of Transportation (DOT), defects investigation, contracted with me
to try what I'd successfully used for email (others' programs).
They were accumulating 50,000 early warning reports a quarter,
yet their engineers had read only 3,000.
DOT contracted for a dozen people to slug thru the accumulated 300,000
reports, identifying those that might portend the necessity of a recall.
But these contractors (probably costing $1 million a year) agreed with
the engineers no more than 50% of the time.
After 2 months, I was able to correctly identify only 30% of reports.
Then I read that Naive Bayes was, after all, "naive".
It presumed independence between words.
There's an old statitical saying,
"Do you prefer to perfectly solve the wrong problem,
or wrongly solve the correct problem?"
People using Naive Bayes use many heuristics, as the CRM114 documents
a. TOE, "Train on Error"
for which you retrain any document that Naive Bayes classifies
Statistically, this is somewhat like having a learned population
with more than one of the same document.
b. SSTTT, "Single Sided Thick Threshhold Training"
for which you retrain a document when it doesn't identify correctly
with a sufficiently high probability.
c. TUNE, "Train Until No Error"
for which you recycle thru your known records until you
reach perfection, although often forced a stop when no improvement
resulted after 12 cycles.
All these techniques improved correct identification and concentration
(proportion of "flagged" reports that are correctly flagged) to about
Then the engineers (gearheads) did the inexplicable -- they read about
20,000 reports, jumping the correctness of the crm114 Naive Bayes
approach with the above heuristics to about 88%.
Suddenly, crm114 Naive Bayes "flagged" reports were fun to read.
For example, a report no-one had yet identified described a fellow's
car modified with airbags to lift the car to a high height
using canisters of some air in the back of his pickup.
Driving down the road, he notice a warning light flashing on his air
Soon afterwards, the passenger seat caught fire.
Even though his pickup was moving down the road,
the flashing warning light and flaming passenger seat
prompted him to open his driver's door and leap from his moving pickup.
While I worked the Bayesian approach and contractors read reports as two
approaches to slug thru 300,000 reports,
big software/contractor companies hovered over the spending and
But their approaches were all judged foolish -- expensively foolish.
So, if you really have a problem worthy of solving well,
some time, and some programming skills,
you can integrate a Naive Bayes procedure with some heuristic
procedures, probably with good correct identification
and a high concentration of correctly "flagged"
documents among Bayes flagged documents.
On Mon, Jul 24, 2006 at 06:59:31PM +0100, Eleni Rapsomaniki wrote:
> This is a question regarding classification performance using different methods.
> So far I've tried NaiveBayes (klaR package), svm (e1071) package and
> randomForest (randomForest). What has puzzled me is that randomForest seems to
> perform far better (32% classification error) than svm and NaiveBayes, which
> have similar classification errors (45%, 48% respectively). A similar
> difference in performance is observed with different combinations of
> parameters, priors and size of training data.
> Because I was expecting to see little difference in the perfomance of these
> methods I am worried that I may have made a mistake in my randomForest call:
> my.rf=randomForest(x=train.df[,-response_index], y=train.df[,response_index],
> xtest=test.df[,-response_index], ytest=test.df[,response_index],
> importance=TRUE,proximity=FALSE, keep.forest=FALSE)
> (where train.df and test.df are my train and test data.frames and response_index
> is the column number specifiying the class)
> My main question is: could there be a legitimate reason why random forest would
> outperform the other two models (e.g. maybe one
> method is more reliable with Gaussian data, handles categorical data
> better etc)? Also, is there a way of evaluating the predictive ability of each
> parameter in the bayesian model as it can be done for random Forests (through
> the importance table)?
> I would appreciate any of your comments and suggestions on these.
> Many thanks
> Eleni Rapsomaniki
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Jameson C. Burt, NJ9L Fairfax, Virginia, USA
jameson at coost.com http://www.coost.com
(202) 690-0380 (work)
LTSP.org: magic "mysterious and awe-inspiring even though
we know they are real and not supernatural"
More information about the R-help