[R] rpart

Tue Sep 26 17:51:11 CEST 2006

-------- Original-Nachricht --------
Datum: Tue, 26 Sep 2006 12:54:22 +0100 (BST)
Von: Prof Brian Ripley <ripley at stats.ox.ac.uk>
An: henrigel at gmx.de
Betreff: Re: [R] rpart

> On Tue, 26 Sep 2006, henrigel at gmx.de wrote:
> 
> >
> > -------- Original-Nachricht --------
> > Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
> > Von: Prof Brian Ripley <ripley at stats.ox.ac.uk>
> > An: henrigel at gmx.de
> > Betreff: Re: [R] rpart
> >
> >> On Mon, 25 Sep 2006, henrigel at gmx.de wrote:
> >>
> >>> Dear r-help-list:
> >>>
> >>> If I use the rpart method like
> >>>
> >>> cfit<-rpart(y~.,data=data,...),
> >>>
> >>> what kind of tree is stored in cfit?
> >>> Is it right that this tree is not pruned at all, that it is the full
> >> tree?
> >>
> >> It is an rpart object.  This contains both the tree and the
> instructions
> >> for pruning it at all values of cp: note that cp is also used in
> deciding
> >> how large a tree to grow.
> >>
> >
> > Ok, I have to explain my problem a little bit more in detail, I'm sorry
> for being so vague:
> > I used the method in the following way:
> > cfit<- rpart(y~., method="class", minsplit=1, cp=0)
> > I got a tree with a lot of terminals nodes that contained more than 100
> observations. This made me believe that the tree was already pruned.
> > On the other hand, the printcp method showed subtrees that were
> "better".
> > This made me believe that the tree hadn't been pruned before.
> > So, are the trees "a little bit" pruned?
> 
> Yes, as you asked for cp=0.  Look up what that does in ?rpart.control.
> 

I thought I would get a full tree by choosing cp=0 - and it was one.
The nodes with more than 100 observations were not split further because there was no sequence of splits which made the class label change for any subset. (A bad explanation, but you probably know what I mean.) I realized that when I chose cp=-1. Thank you very much for your help!  

> >>> If so, it's up to me to choose a subtree by using the printcp method.
> >>
> >> Or the plotcp method.
> >>
> >>> In the technical report from Atkinson and Therneau "An Introduction to
> >>> recursive partitioning using the rpart routines" from 2000, one can
> see
> >>> the following table on page 15:
> >>>
> >>>      CP  nsplit  relerror  xerror   xstd
> >>> 1   0.105   0     1.00000   1.0000   0.108
> >>> 2   0.056   3     0.68519   1.1852   0.111
> >>> 3   0.028   4     0.62963   1.0556   0.109
> >>> 4   0.574   6     0.57407   1.0556   0.109
> >>> 5   0.100   7     0.55556   1.0556   0.109
> >>>
> >>> Some lines below it says "We see that the best tree has 5 terminal
> nodes
> >>> (4 splits). Why that if the xerror is the lowest for the tree only
> >>> consisting of the root?
> >>
> >> There are *two* reports with that name: this seems to be from
> minitech.ps.
> >> The choice is explained in the rest of that para (the 1-SE rule was
> used).
> >> My guess is that the authors excluded the root as not being a tree, but
> >> only they can answer that.
> >>
> >
> > Are both reports from 2000? But you're right, I'm talking about the one
> from minitch.ps.
> > The 1-SE-rule only explains why they didn't choose the tree with 6 or 7
> splits, but not why they didn't choose the "tree" without a split.
> > The exclusion of the root as not being a tree was my first explanation,
> too. But if the tree only consisting of the root is still better than any
> other tree, why would I choose a tree with 4 splits then?
> >
> > 

Henri

--