[R] Question about rpart decision trees (being used to predict customer churn)
Carlos J. Gil Bellosta
cgb at datanalytics.com
Sun Aug 2 18:34:01 CEST 2009
Hello,
Isn't it totally counter-intuitive that if you penalize the error less
the tree finds it?
See:
experience <- as.factor(c(rep("good",90), rep("bad",10)))
cancel <- as.factor(c(rep("no",85), rep("yes",5),
rep("no",5),rep("yes",5)))
foo <- function( i ){
tmp <- rpart(cancel ~ experience, parms=list(loss=matrix(c(0,i,1,0),
byrow=TRUE,nrow=2)))
nrow( tmp$frame )
}
sapply( 1:20, foo )
The ouput I get is:
[1] 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1
So, something unexpected happens after penalization exceeds 16... Should
it be?
Best regards,
Carlos J. Gil Bellosta
http://www.datanalytics.com
On Sun, 2009-08-02 at 08:41 +1000, Graham Williams wrote:
> 2009/7/27 Robert Smith <robertpsmith2008 at gmail.com>
>
> > Hi,
> >
> > I am using rpart decision trees to analyze customer churn. I am finding
> > that
> > the decision trees created are not effective because they are not able to
> > recognize factors that influence churn. I have created an example situation
> > below. What do I need to do to for rpart to build a tree with the variable
> > experience? My guess is that this would happen if rpart used the loss
> > matrix
> > while creating the tree.
> >
> > > experience <- as.factor(c(rep("good",90), rep("bad",10)))
> > > cancel <- as.factor(c(rep("no",85), rep("yes",5), rep("no",5),
> > rep("yes",5)))
> > > table(experience, cancel)
> > cancel
> > experience no yes
> > bad 5 5
> > good 85 5
> > > rpart(cancel ~ experience)
> > n= 100
> > node), split, n, loss, yval, (yprob)
> > * denotes terminal node
> > 1) root 100 10 no (0.9000000 0.1000000) *
> >
> > I tried the following commands with no success.
> > rpart(cancel ~ experience, control=rpart.control(cp=.0001))
> > rpart(cancel ~ experience, parms=list(split='information'))
> > rpart(cancel ~ experience, parms=list(split='information'),
> > control=rpart.control(cp=.0001))
> > rpart(cancel ~ experience, parms=list(loss=matrix(c(0,1,10000,0), nrow=2,
> > ncol=2)))
> >
> > Thanks a lot for your help.
> >
> > Best regards,
> > Robert
> >
>
> Hi Robert,
>
> Perhaps try a less extreme loss matrix:
>
> rpart(cancel ~ experience, parms=list(loss=matrix(c(0,5,1,0), byrow=TRUE,
> nrow=2)))
>
> Output from Rattle:
>
> Summary of the Tree model for Classification (built using rpart):
>
> n= 100
>
> node), split, n, loss, yval, (yprob)
> * denotes terminal node
>
> 1) root 100 50 no (0.90000000 0.10000000)
> 2) experience=good 90 25 no (0.94444444 0.05555556) *
> 3) experience=bad 10 5 yes (0.50000000 0.50000000) *
>
> Classification tree:
> rpart(formula = cancel ~ ., data = crs$dataset, method = "class",
> parms = list(loss = matrix(c(0, 5, 1, 0), byrow = TRUE, nrow = 2)),
> control = rpart.control(cp = 0.0001, usesurrogate = 0, maxsurrogate =
> 0))
>
> Variables actually used in tree construction:
> [1] experience
>
> Root node error: 50/100 = 0.5
>
> n= 100
>
> CP nsplit rel error xerror xstd
> 1 0.4000 0 1.0 1.0 0.30
> 2 0.0001 1 0.6 0.6 0.22
>
> TRAINING DATA Error Matrix - Counts
>
> Actual
> Predicted no yes
> no 85 5
> yes 5 5
>
>
> TRAINING DATA Error Matrix - Percentages
>
> Actual
> Predicted no yes
> no 85 5
> yes 5 5
>
> Time taken: 0.01 secs
>
> Generated by Rattle 2009-08-02 08:24:50 gjw
> ======================================================================
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list