[R] (no subject)

Fri Apr 2 16:25:58 CEST 2010

> I'm using rpart function for creating regression trees.
> now how to measure the fitness of regression tree???
> 
> thanks n Regards,
> Vibha

I read R-help as a digest so often come late to a discussion.  Let me
start by being the first to directly answer the question:

> fit <- rpart(time ~ age +ph.ecog,lung)
> summary(fit)
Call:
rpart(formula = time ~ age + ph.ecog, data = lung)
  n= 228 

          CP nsplit rel error   xerror      xstd
1 0.03516666      0 1.0000000 1.009949 0.1137819
2 0.01459053      1 0.9648333 1.049636 0.1282259
3 0.01324335      3 0.9356523 1.090562 0.1301632
4 0.01000000      7 0.8810284 1.063609 0.1298557

Node number 1: 228 observations,    complexity param=0.03516666
  mean=305.2325, MSE=44176.93 
  left son=2 (51 obs) right son=3 (177 obs)
  Primary splits:
...

The relative error and cross-validated relative error columns above, for
a regression tree, are equal to 1-R^2.  In this case none of the splits
are useful; even the naive non-cross-validated improvement for the first
split isn't much (R^2 < .04).

Now to the larger debate.  I do not find trees as useless as Frank (does
anyone).  I like to use them for initial data exploration, in the same
fashion as a scatterplot.  But I fight the same battle that he does with
some colleages and customers: they are so very easy to interpret that
the results are often severely over-interpreted, sometimes to the point
that the tree did more harm than good.

All forward stepwise procedures are unstable.  Particularly with rich
data sets, such as I see each day in the medical field, there are
mulitple overlapping/correlated predictors.  Small changes in the data
will completely change the order of a forward stepwise regression.
Anyone who puts faith in the ORDER of inclusion as a measure of worth is
like a flag in a fitful breeze.
 A bigger problem with rpart is the users consistenly ignore the xerror
column above, and print out (and believe) bigger trees than they should.
Once the xerror bottoms out you are almost certainly looking at random
noise.  Since the xerror curve often has a long flat bottom the 1SE rule
is better (anything within 1SE of the min is a tie, use the smallest of
a set of tied models).

Terry Therneau