[R] rpart minimum sample size
Terry Therneau
therneau at mayo.edu
Wed Feb 28 15:59:44 CET 2007
Look at rpart.control. Rpart has two "advisory" parameters that control
the tree size at the smallest nodes:
minsplit (default 20): a node with less than this many subjects will
not be worth splitting
minbucket (default 7) : don't create any final nodes with <7
observations
As I said, these are advisory, and reflect that these final splits are usually
not worthwhile. They lead to a little faster run time, but mostly to a less
complex plotted model.
I am not nearly as pessimistic as Frank Harrell ("need 20,000 observations").
Rpart often gives a good model -- one that predicts the outcome, and I find
the intermediate steps that it takes informative. However, there are often many
trees with similar predictive ability, but a very different "look" in terms
of splitpoints and variables. Saying that any given rpart model is THE best
is perilous.
Terry T.
More information about the R-help
mailing list