[R] method of rpart when response variable is binary?

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Jun 15 16:21:51 CEST 2007


On Fri, 15 Jun 2007, ronggui wrote:

> Dear all,
>
> I would like to model the relationship between y and x. y is binary
> variable, and x is a count variable which may be possion-distribution.
>
> I think it is better to divide x into intervals and change it to a
> factor before calling glm(y~x,data=dat,family=binomail).
>
> I try to use rpart. As y is binary, I use "class" method and get the
> following result.
>> rpart(y~x,data=dat,method="class")
> n=778 (22 observations deleted due to missingness)
>
> node), split, n, loss, yval, (yprob)
>      * denotes terminal node
>
> 1) root 778 67 0 (0.91388175 0.08611825) *
>
>
> If with the default method, I get such a result.
>
>> rpart(y~x,data=dat)
> n=778 (22 observations deleted due to missingness)
>
> node), split, n, deviance, yval
>      * denotes terminal node
>
> 1) root 778 61.230080 0.08611825
>  2) x< 19.5 750 53.514670 0.07733333
>    4) x< 1.25 390 17.169230 0.04615385 *
>    5) x>=1.25 360 35.555560 0.11111110 *
>  3) x>=19.5 28  6.107143 0.32142860 *
>
> If I use 1.25 and 19.5 as the cutting points, change x into factor by
>> x2 <- cut(q34b,breaks=c(0,1.25,19.5,200),right=F)
>
> The coef in y~x2 is significant and makes sense.
>
> My problem is: is it OK use the default method in rpart when response
> varibale is binary one?  Thanks.

Not unless you want a least-squares fit.  Note that you have only 8.6% of 
one class, and for such an unbalanced classification problem you are 
unlikely to do better than declaring class 1 for all examples.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list