[R] rpart minimum sample size

Terry Therneau therneau at mayo.edu
Wed Feb 28 15:59:44 CET 2007


  Look at rpart.control.  Rpart has two "advisory" parameters that control
the tree size at the smallest nodes:
	minsplit (default 20): a node with less than this many subjects will
	not be worth splitting
	
	minbucket (default 7) : don't create any final nodes with <7 
	observations
	
As I said, these are advisory, and reflect that these final splits are usually
not worthwhile.  They lead to a little faster run time, but mostly to a less
complex plotted model.

  I am not nearly as pessimistic as Frank Harrell ("need 20,000 observations").
Rpart often gives a good model -- one that predicts the outcome, and I find
the intermediate steps that it takes informative.  However, there are often many
trees with similar predictive ability, but a very different "look" in terms
of splitpoints and variables.  Saying that any given rpart model is THE best
is perilous.
	Terry T.



More information about the R-help mailing list