[R] rpart

Tue Sep 26 13:54:22 CEST 2006

On Tue, 26 Sep 2006, henrigel at gmx.de wrote:

>
> -------- Original-Nachricht --------
> Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
> Von: Prof Brian Ripley <ripley at stats.ox.ac.uk>
> An: henrigel at gmx.de
> Betreff: Re: [R] rpart
>
>> On Mon, 25 Sep 2006, henrigel at gmx.de wrote:
>>
>>> Dear r-help-list:
>>>
>>> If I use the rpart method like
>>>
>>> cfit<-rpart(y~.,data=data,...),
>>>
>>> what kind of tree is stored in cfit?
>>> Is it right that this tree is not pruned at all, that it is the full
>> tree?
>>
>> It is an rpart object.  This contains both the tree and the instructions
>> for pruning it at all values of cp: note that cp is also used in deciding
>> how large a tree to grow.
>>
>
> Ok, I have to explain my problem a little bit more in detail, I'm sorry for being so vague:
> I used the method in the following way:
> cfit<- rpart(y~., method="class", minsplit=1, cp=0)
> I got a tree with a lot of terminals nodes that contained more than 100 observations. This made me believe that the tree was already pruned.
> On the other hand, the printcp method showed subtrees that were "better".
> This made me believe that the tree hadn't been pruned before.
> So, are the trees "a little bit" pruned?

Yes, as you asked for cp=0.  Look up what that does in ?rpart.control.

>>> If so, it's up to me to choose a subtree by using the printcp method.
>>
>> Or the plotcp method.
>>
>>> In the technical report from Atkinson and Therneau "An Introduction to
>>> recursive partitioning using the rpart routines" from 2000, one can see
>>> the following table on page 15:
>>>
>>>      CP  nsplit  relerror  xerror   xstd
>>> 1   0.105   0     1.00000   1.0000   0.108
>>> 2   0.056   3     0.68519   1.1852   0.111
>>> 3   0.028   4     0.62963   1.0556   0.109
>>> 4   0.574   6     0.57407   1.0556   0.109
>>> 5   0.100   7     0.55556   1.0556   0.109
>>>
>>> Some lines below it says "We see that the best tree has 5 terminal nodes
>>> (4 splits). Why that if the xerror is the lowest for the tree only
>>> consisting of the root?
>>
>> There are *two* reports with that name: this seems to be from minitech.ps.
>> The choice is explained in the rest of that para (the 1-SE rule was used).
>> My guess is that the authors excluded the root as not being a tree, but
>> only they can answer that.
>>
>
> Are both reports from 2000? But you're right, I'm talking about the one from minitch.ps.
> The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, but not why they didn't choose the "tree" without a split.
> The exclusion of the root as not being a tree was my first explanation, too. But if the tree only consisting of the root is still better than any other tree, why would I choose a tree with 4 splits then?
>
> Henri
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595