[R] Question about rpart decision trees (being used to predict customer churn)

Sun Aug 2 18:34:01 CEST 2009

Hello,

Isn't it totally counter-intuitive that if you penalize the error less
the tree finds it?

See:

experience <- as.factor(c(rep("good",90), rep("bad",10)))
cancel <- as.factor(c(rep("no",85), rep("yes",5),
rep("no",5),rep("yes",5)))

foo <- function( i ){
    tmp <- rpart(cancel ~ experience, parms=list(loss=matrix(c(0,i,1,0),
byrow=TRUE,nrow=2)))
    nrow( tmp$frame )
}

sapply( 1:20, foo )

The ouput I get is:

 [1] 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1

So, something unexpected happens after penalization exceeds 16... Should
it be?

Best regards,

Carlos J. Gil Bellosta
http://www.datanalytics.com

On Sun, 2009-08-02 at 08:41 +1000, Graham Williams wrote:
> 2009/7/27 Robert Smith <robertpsmith2008 at gmail.com>
> 
> > Hi,
> >
> > I am using rpart decision trees to analyze customer churn. I am finding
> > that
> > the decision trees created are not effective because they are not able to
> > recognize factors that influence churn. I have created an example situation
> > below. What do I need to do to for rpart to build a tree with the variable
> > experience? My guess is that this would happen if rpart used the loss
> > matrix
> > while creating the tree.
> >
> > > experience <- as.factor(c(rep("good",90), rep("bad",10)))
> > > cancel <- as.factor(c(rep("no",85), rep("yes",5), rep("no",5),
> > rep("yes",5)))
> > > table(experience, cancel)
> >          cancel
> > experience no yes
> >      bad   5   5
> >      good 85   5
> > > rpart(cancel ~ experience)
> > n= 100
> > node), split, n, loss, yval, (yprob)
> >      * denotes terminal node
> > 1) root 100 10 no (0.9000000 0.1000000) *
> >
> > I tried the following commands with no success.
> > rpart(cancel ~ experience, control=rpart.control(cp=.0001))
> > rpart(cancel ~ experience, parms=list(split='information'))
> > rpart(cancel ~ experience, parms=list(split='information'),
> > control=rpart.control(cp=.0001))
> > rpart(cancel ~ experience, parms=list(loss=matrix(c(0,1,10000,0), nrow=2,
> > ncol=2)))
> >
> > Thanks a lot for your help.
> >
> > Best regards,
> > Robert
> >
> 
> Hi Robert,
> 
> Perhaps try a less extreme loss matrix:
> 
> rpart(cancel ~ experience, parms=list(loss=matrix(c(0,5,1,0), byrow=TRUE,
> nrow=2)))
> 
> Output from Rattle:
> 
> Summary of the Tree model for Classification (built using rpart):
> 
> n= 100
> 
> node), split, n, loss, yval, (yprob)
>       * denotes terminal node
> 
> 1) root 100 50 no (0.90000000 0.10000000)
>   2) experience=good 90 25 no (0.94444444 0.05555556) *
>   3) experience=bad 10  5 yes (0.50000000 0.50000000) *
> 
> Classification tree:
> rpart(formula = cancel ~ ., data = crs$dataset, method = "class",
>     parms = list(loss = matrix(c(0, 5, 1, 0), byrow = TRUE, nrow = 2)),
>     control = rpart.control(cp = 0.0001, usesurrogate = 0, maxsurrogate =
> 0))
> 
> Variables actually used in tree construction:
> [1] experience
> 
> Root node error: 50/100 = 0.5
> 
> n= 100
> 
>       CP nsplit rel error xerror xstd
> 1 0.4000      0       1.0    1.0 0.30
> 2 0.0001      1       0.6    0.6 0.22
> 
> TRAINING DATA Error Matrix - Counts
> 
>          Actual
> Predicted no yes
>       no  85   5
>       yes  5   5
> 
> 
> TRAINING DATA Error Matrix - Percentages
> 
>          Actual
> Predicted no yes
>       no  85   5
>       yes  5   5
> 
> Time taken: 0.01 secs
> 
> Generated by Rattle 2009-08-02 08:24:50 gjw
> ======================================================================
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.