[R] rpart
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Sep 26 13:54:22 CEST 2006
On Tue, 26 Sep 2006, henrigel at gmx.de wrote:
>
> -------- Original-Nachricht --------
> Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
> Von: Prof Brian Ripley <ripley at stats.ox.ac.uk>
> An: henrigel at gmx.de
> Betreff: Re: [R] rpart
>
>> On Mon, 25 Sep 2006, henrigel at gmx.de wrote:
>>
>>> Dear r-help-list:
>>>
>>> If I use the rpart method like
>>>
>>> cfit<-rpart(y~.,data=data,...),
>>>
>>> what kind of tree is stored in cfit?
>>> Is it right that this tree is not pruned at all, that it is the full
>> tree?
>>
>> It is an rpart object. This contains both the tree and the instructions
>> for pruning it at all values of cp: note that cp is also used in deciding
>> how large a tree to grow.
>>
>
> Ok, I have to explain my problem a little bit more in detail, I'm sorry for being so vague:
> I used the method in the following way:
> cfit<- rpart(y~., method="class", minsplit=1, cp=0)
> I got a tree with a lot of terminals nodes that contained more than 100 observations. This made me believe that the tree was already pruned.
> On the other hand, the printcp method showed subtrees that were "better".
> This made me believe that the tree hadn't been pruned before.
> So, are the trees "a little bit" pruned?
Yes, as you asked for cp=0. Look up what that does in ?rpart.control.
>>> If so, it's up to me to choose a subtree by using the printcp method.
>>
>> Or the plotcp method.
>>
>>> In the technical report from Atkinson and Therneau "An Introduction to
>>> recursive partitioning using the rpart routines" from 2000, one can see
>>> the following table on page 15:
>>>
>>> CP nsplit relerror xerror xstd
>>> 1 0.105 0 1.00000 1.0000 0.108
>>> 2 0.056 3 0.68519 1.1852 0.111
>>> 3 0.028 4 0.62963 1.0556 0.109
>>> 4 0.574 6 0.57407 1.0556 0.109
>>> 5 0.100 7 0.55556 1.0556 0.109
>>>
>>> Some lines below it says "We see that the best tree has 5 terminal nodes
>>> (4 splits). Why that if the xerror is the lowest for the tree only
>>> consisting of the root?
>>
>> There are *two* reports with that name: this seems to be from minitech.ps.
>> The choice is explained in the rest of that para (the 1-SE rule was used).
>> My guess is that the authors excluded the root as not being a tree, but
>> only they can answer that.
>>
>
> Are both reports from 2000? But you're right, I'm talking about the one from minitch.ps.
> The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, but not why they didn't choose the "tree" without a split.
> The exclusion of the root as not being a tree was my first explanation, too. But if the tree only consisting of the root is still better than any other tree, why would I choose a tree with 4 splits then?
>
> Henri
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list