[R] method of rpart when response variable is binary?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Jun 15 16:21:51 CEST 2007
On Fri, 15 Jun 2007, ronggui wrote:
> Dear all,
>
> I would like to model the relationship between y and x. y is binary
> variable, and x is a count variable which may be possion-distribution.
>
> I think it is better to divide x into intervals and change it to a
> factor before calling glm(y~x,data=dat,family=binomail).
>
> I try to use rpart. As y is binary, I use "class" method and get the
> following result.
>> rpart(y~x,data=dat,method="class")
> n=778 (22 observations deleted due to missingness)
>
> node), split, n, loss, yval, (yprob)
> * denotes terminal node
>
> 1) root 778 67 0 (0.91388175 0.08611825) *
>
>
> If with the default method, I get such a result.
>
>> rpart(y~x,data=dat)
> n=778 (22 observations deleted due to missingness)
>
> node), split, n, deviance, yval
> * denotes terminal node
>
> 1) root 778 61.230080 0.08611825
> 2) x< 19.5 750 53.514670 0.07733333
> 4) x< 1.25 390 17.169230 0.04615385 *
> 5) x>=1.25 360 35.555560 0.11111110 *
> 3) x>=19.5 28 6.107143 0.32142860 *
>
> If I use 1.25 and 19.5 as the cutting points, change x into factor by
>> x2 <- cut(q34b,breaks=c(0,1.25,19.5,200),right=F)
>
> The coef in y~x2 is significant and makes sense.
>
> My problem is: is it OK use the default method in rpart when response
> varibale is binary one? Thanks.
Not unless you want a least-squares fit. Note that you have only 8.6% of
one class, and for such an unbalanced classification problem you are
unlikely to do better than declaring class 1 for all examples.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list