[R] questions on rpart (tree changes when rearrange the order of covariates?!)

Dimitri Liakhovitski ld7631 at gmail.com
Wed May 13 15:29:41 CEST 2009


I wonder - isn't this issue one of the reasons to use RandomForests
rather than CART?

On Wed, May 13, 2009 at 8:03 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
> From: Uwe Ligges
>>
>> Yuanyuan wrote:
>> > Greetings,
>> >
>> > I am using rpart for classification with "class" method.
>> The test data  is
>> > the Indian diabetes data from package mlbench.
>> >
>> > I fitted a classification tree firstly using the original
>> data, and then
>> > exchanged the order of Body mass and Plasma glucose which are the
>> > strongest/important variables in the growing phase. The
>> second tree is a
>> > little different from the first one. The misclassification
>> tables are
>> > different too. I did not change the data, but why the results are so
>> > different?
>>
>> Well, at some splits the variable that comes first and yields in the
>> same reduction of the entropy criterion as another one might be used,
>> hence another result.
>>
>> Uwe Ligges
>
> I recently tried writing adaboost.m1 using rpart, and was surprised that
> with very small training set (say n=10 or 20), I get a large improvement
> in test set accuracy if I randomly shuffle the columns in the data at
> every adaboost iteration.  (With twonorm data, we're talking about 25%
> error vs. 19%, using n=2000 test set.)  It turned out to be the way
> rpart deals with ties--- first come, first win.  Without shuffling the
> columns, rpart almost never pick any variable beyond the 10th.  (In
> twonorm, all variables are equally important, so one would expect
> roughly equal selection frequency.)
>
> I've gotten some pointers from Terry Therneau about where in the code to
> check.  I may try to implement breaking ties at random (as I've done in
> randomForest).  No promises, though...
>
> Andy
>
>>
>>
>>
>> >
>> > Does anyone know how rpart deal with ties?
>> >
>> > Here is the codes for running the two trees.
>> >
>> >
>> > library(mlbench)
>> > data(PimaIndiansDiabetes2)
>> > mydata<-PimaIndiansDiabetes2
>> > library(rpart)
>> > fit2<-rpart(diabetes~., data=mydata,method="class")
>> > plot(fit2,uniform=T,main="CART for original data")
>> > text(fit2,use.n=T,cex=0.6)
>> > printcp(fit2)
>> > table(predict(fit2,type="class"),mydata$diabetes)
>> > ## misclassifcation table: rows are fitted class
>> >       neg pos
>> >   neg 437  68
>> >   pos  63 200
>> > #Klimt(fit2,mydata)
>> >
>> > pmydata<-data.frame(mydata[,c(1,6,3,4,5,2,7,8,9)])
>> > fit3<-rpart(diabetes~., data=pmydata,method="class")
>> > plot(fit3,uniform=T,main="CART after exchaging mass & glucose")
>> > text(fit3,use.n=T,cex=0.6)
>> > printcp(fit3)
>> > table(predict(fit3,type="class"),pmydata$diabetes)
>> > ##after exchage the order of BODY mass and PLASMA glucose
>> >       neg pos
>> >   neg 436  64
>> >   pos  64 204
>> > #Klimt(fit3,pmydata)
>> >
>> >
>> > Thanks,
>> >
>> >
>> >
>> --------------------------------------------------------------
>> ------------------------
>> > Yuanyuan Huang
>> >
>> >     [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> Notice:  This e-mail message, together with any attachme...{{dropped:12}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Dimitri Liakhovitski
MarketTools, Inc.
Dimitri.Liakhovitski at markettools.com




More information about the R-help mailing list