[R] User-defined functions in rpart

Tue Jun 17 13:35:12 CEST 2003

> This question concerns rpart's facility for user-defined functions that
> accomplish splitting.
>
> I was interested in modifying the code so that in each terminal node,
> a linear regression is fit to the data.
>

if you just want to have a linear model in each terminal node instead of
the mean of the observations in this leaf (which does not require an
altered splitting rule), you can do something like (BostonHousing as example):

R> library(mlbench)
R> library(rpart)
### a stump only
R> tree <- rpart(medv ~ ., data=BostonHousing, cp=0.2)
R> tree
n= 506

node), split, n, deviance, yval
      * denotes terminal node

1) root 506 42716.300 22.53281
  2) rm< 6.941 430 17317.320 19.93372 *
  3) rm>=6.941 76  6059.419 37.23816 *
### fit a linear model for the observations in each leaf
R> tnodeleft <- lm(medv ~ ., data=BostonHousing, subset=(tree$where == 2))
R> tnoderight <- lm(medv ~ ., data=BostonHousing, subset=(tree$where == 3))
R> coef(tnodeleft)
  (Intercept)          crim            zn         indus         chas1
 5.291185e+01 -1.430523e-01  3.879871e-02  2.818030e-02  3.329020e+00
          nox            rm           age           dis           rad
-1.686631e+01 -3.671042e-01 -3.791325e-04 -1.128992e+00  2.954484e-01
          tax       ptratio             b         lstat
-1.071171e-02 -5.472690e-01  6.393379e-03 -5.578978e-01
R> coef(tnoderight)
  (Intercept)          crim            zn         indus         chas1
  8.224189122  -0.078029764   0.001545735   0.616441038  -0.928230180
          nox            rm           age           dis           rad
-17.080614140   6.372744363  -0.060863991  -1.032928738   0.451670727
          tax       ptratio             b         lstat
 -0.034438740  -1.647800912   0.093841671  -1.172865798

> It seems that from the allowable inputs in the user-defined functions,
> that this may not be possible, since they have the form:
>
> function(y, wt, parms)  (in the case of the "evaluation" function)
> function(y, wt, x, parms, continuous)  (split function)
>
> The problem is that there seems to be no facility to include an X
> matrix (in the split function, x is a vector corresponding to one
> predictor).  Without that, fitting a linear model in the terminal node
> would not be possible.
>
> Is this a correct assesment, or am I missing something?

If you want to base your splitting criterion on linear models in EACH
node, you may try something like this.
I'm not sure if you need to pass the design matrix to the split function:
let Y and X denote response and design matrix in the calling environment,
than an ugly hack like:

temp2 <- function(y, wt, x, parms, continuous) {
  thisindx <- Y %in% y  ### determine which subset of the learning
                        ### sample is currently under consideration

  thisX <- X[thisindx,] ### only those in the current node

  ### get the position of the x to split in:
  myx <- which(apply(thisX, 2, function(a) all(a %in% x))])

  ### and now seach for the best split in thisX[,thisx]
  ### for each cutpoint cut ...
  lm(y ~ ., data = thisX, subset=thisX[,myx] <= cut)
  lm(y ~ ., data = thisX, subset=thisX[,myx] > cut)

... may work

just an idea,

Torsten

> Has anyone tried to modify rpart to fit linear models in nodes?
>
>
> --
> --
> -- Hugh Chipman                                                      --
> -- Associate Professor, Department of Statistics & Actuarial Science --
> -- U. of Waterloo, 200 University Ave. W, Waterloo, Ontario, N2L 3G1 --
> -- (519) 888-4567 ext. 6190                      Fax: (519) 746-1875 --
> -- www.stats.uwaterloo.ca/~hachipma/          hachipman at uwaterloo.ca --
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
>