[R] rpart

Tue Sep 26 13:34:44 CEST 2006

-------- Original-Nachricht --------
Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
Von: Prof Brian Ripley <ripley at stats.ox.ac.uk>
An: henrigel at gmx.de
Betreff: Re: [R] rpart

> On Mon, 25 Sep 2006, henrigel at gmx.de wrote:
> 
> > Dear r-help-list:
> >
> > If I use the rpart method like
> >
> > cfit<-rpart(y~.,data=data,...),
> >
> > what kind of tree is stored in cfit?
> > Is it right that this tree is not pruned at all, that it is the full
> tree?
> 
> It is an rpart object.  This contains both the tree and the instructions 
> for pruning it at all values of cp: note that cp is also used in deciding 
> how large a tree to grow.
> 

Ok, I have to explain my problem a little bit more in detail, I'm sorry for being so vague:
I used the method in the following way:
cfit<- rpart(y~., method="class", minsplit=1, cp=0)
I got a tree with a lot of terminals nodes that contained more than 100 observations. This made me believe that the tree was already pruned.
On the other hand, the printcp method showed subtrees that were "better".
This made me believe that the tree hadn't been pruned before.
So, are the trees "a little bit" pruned? 

> > If so, it's up to me to choose a subtree by using the printcp method.
> 
> Or the plotcp method.
> 
> > In the technical report from Atkinson and Therneau "An Introduction to 
> > recursive partitioning using the rpart routines" from 2000, one can see 
> > the following table on page 15:
> >
> >      CP  nsplit  relerror  xerror   xstd
> > 1   0.105   0     1.00000   1.0000   0.108
> > 2   0.056   3     0.68519   1.1852   0.111
> > 3   0.028   4     0.62963   1.0556   0.109
> > 4   0.574   6     0.57407   1.0556   0.109
> > 5   0.100   7     0.55556   1.0556   0.109
> >
> > Some lines below it says "We see that the best tree has 5 terminal nodes
> > (4 splits). Why that if the xerror is the lowest for the tree only 
> > consisting of the root?
> 
> There are *two* reports with that name: this seems to be from minitech.ps.
> The choice is explained in the rest of that para (the 1-SE rule was used).
> My guess is that the authors excluded the root as not being a tree, but 
> only they can answer that.
> 

Are both reports from 2000? But you're right, I'm talking about the one from minitch.ps.
The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, but not why they didn't choose the "tree" without a split.
The exclusion of the root as not being a tree was my first explanation, too. But if the tree only consisting of the root is still better than any other tree, why would I choose a tree with 4 splits then?  

Henri

--