[R] Recursive partitioning algorithms in R vs. alia
Terry Therneau
therneau at mayo.edu
Tue Jun 23 15:21:37 CEST 2009
A point of history:
Both the commercial CART program and the rpart() function are based on the
book Classification and Regression Trees (Breiman, Friedman, Olshen, Stone,
1984). As a reader/commentator on one of the early drafts I got to know the
material well. CART started as a large Fortran program written by Jerry
Friedman which was the testing ground for the ideas in the book. I had the code
at one time and made some modifications to it, but found it too frustrating to
go very far with. Fortran is just too clumsy for a recursive task, and Jerry's
ability to hold upteen variables in his head at once greater than mine -- the
Fortran was a large monlithic block. Salford Systems aquired rights to that
code; I don't know whether any of the original lines remain in their product. I
had lots of conversations with their main programmer (15-20 years ago now) about
methods for speeding it up; mainly an interesting problem in optimal indexing.
When rpart was first written it's output agreed with CART almost entirely.
The only major difference was in surrogates: I pick the surrogate with the
largest number of agreements, CART picked that with the greatest % agreement.
This means that rpart favors variables with fewer missing values. Since that
point in time both codes have evolved. I haven't had time to do important work
on rpart in over a decade. It' not surprising that the graphics and display are
behind the curve, what's more surprising is that it still endures.
Rpart is called "rpart" because the authors copyrighted the term "CART" for
their program. It was the best alternative name that I could come up with at
the time. I find it amusing that one consequence of their copyright choice is
that I now see "recursive partitioning" far more often than "CART" as the
generic label for tree based methods.
Terry T
More information about the R-help
mailing list