[R] Odd results from rpart classification tree
Marshall, Jonathan
J.C.marshall at massey.ac.nz
Mon May 15 04:25:33 CEST 2017
The following code produces a tree with only a root. However, clearly the tree with a split at x=0.5 is better. rpart doesn't seem to want to produce it.
Running the following produces a tree with only root.
y <- c(rep(0,65),rep(1,15),rep(0,20))
x <- c(rep(0,70),rep(1,30))
f <- rpart(y ~ x, method='class', minsplit=1, cp=0.0001, parms=list(split='gini'))
Computing the improvement for a split at x=0.5 manually:
obs_L <- y[x<.5]
obs_R <- y[x>.5]
n_L <- sum(x<.5)
n_R <- sum(x>.5)
gini <- function(p) {sum(p*(1-p))}
impurity_root <- gini(prop.table(table(y)))
impurity_L <- gini(prop.table(table(obs_L)))
impurity_R <- gini(prop.table(table(obs_R)))
impurity <- impurity_root * n - (n_L*impurity_L + n_R*impurity_R) # 2.880952
Thus, an improvement of 2.88 should result in a split. It does not.
Why?
Jonathan
More information about the R-help
mailing list