[R] rpart problem

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Sep 6 22:50:46 CEST 2004


I think you are confusing the purpose of rpart, which is prediction.
You want to predict `mysuccess'.

One group has 90% success, so the best prediction is `success'.
The other group has 60% success, so the best prediction is `success'.

So there is no point in splitting into groups.  Replace 60% by 30% and the 
best prediction for group 2 changes.

If this is not now obvious, please read up on tree-based methods.

On Mon, 6 Sep 2004 pfm401 at lineone.net wrote:

> Dear all,
> 
> I am having some trouble with getting the rpart function to work as expected.
> I am trying to use rpart to combine levels of a factor to reduce the number
> of levels of that factor. In exploring the code I have noticed that it is
> possible for chisq.test to return a statistically significant result whilst
> the rpart method returns only the root node (i.e. no split is made). The
> following code recreates the issue using simulated data :
> 
> 
> # Create a 2 level factor with group 1 probability of success 90% and group
> 2 60%
> tmp1  <- as.factor((runif (1000) <= 0.9))
> tmp2  <- as.factor((runif (1000) <= 0.5))

Is 0.5 a typo?

> mysuccess <- as.factor(c(tmp1, tmp2)) 
> mygroup   <- as.factor(c(rep (1,1000), rep (2,1000)))
> 
> table (mysuccess, mygroup)
> chisq.test (mysuccess, mygroup)
> # p-value = < 2.2e-16
> 
> myrpart <- rpart (mysuccess ~ mygroup)
> myrpart
> # rpart does not provide splits !!
> 
> 
> 
> If I change the parameter in the setting of group 2 to 0.3 from 0.6 rpart
> does return splits, i.e. change the line 
> 
> tmp2  <- as.factor((runif (1000) <= 0.6))
> 
> to 
> 
> tmp2  <- as.factor((runif (1000) <= 0.3))
> 
> rpart does split the nodes, but as the split with 0.6 is highly significant
> I would still have expected a split in this case too.
> 
>  
> I would appreciate any advice as to whether this is a known feature of rpart,
> whether I need to change the way my data are stored, or set some of the
> control options. I have tested a few of these options with no success.

Testing cp < 0 will have an effect.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list