\documentclass{article} \usepackage[pdftex]{graphicx} \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} \SweaveOpts{keep.source=TRUE, fig=FALSE} %\VignetteIndexEntry{User Written Split Functions} %\VignetteDepends{rpart} % Ross Ihaka suggestions \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \newcommand{\myfig}[1]{\resizebox{\textwidth}{!} {\includegraphics{#1.pdf}}} \title{User written splitting functions for RPART} \author{Terry Therneau \\Mayo Clinic} \begin{document} \maketitle <>= options(continue=" ", width=60) options(SweaveHooks=list(fig=function() par(mar=c(5.1, 4.1, .3, 1.1)))) @ \section{Splitting functions} The rpart code was written in a modular fashion with the idea that the C code would be extended to include more splitting functions. As a consequence all of the splitting functions are coordinated through a single structure in the file \texttt{func\_table.h}, shown below: \begin{verbatim} static struct { int (*init_split)(); void (*choose_split)(); void (*eval)(); double (*error)(); } func_table [] = { {anovainit, anova, anovass, anovapred}, {poissoninit, poisson, poissondev, poissonpred}, {giniinit, gini, ginidev, ginipred}, {usersplit_init, usersplit, usersplit_eval, usersplit_pred} }; #define NUM_METHODS 4 /*size of the above structure */ \end{verbatim} Adding a new splitting method requires the definition of four functions which initialize, choose a split, compute the predicted value or values for a node, and compute a prediction error for a new observation assigned to the node. The last of these is used only in cross-validation. To add new splitting rules to the main routine four new C functions need to be defined, the \texttt{func\_table.h} file expanded, the R function \texttt{rpart} updated to recognize the new option (it calls the C routine with the row number of func\_table), and finally an initialization function written. See the \texttt{rpart.class}, \texttt{rpart.anova}, and \texttt{rpart.exp} functions for examples of the latter. The lion's share of the code, which deals with all the necessary bookkeeping, remains unchanged. An easier route is to add a user defined splitting rules as a set of R functions. The resulting code will be considerably slower than the built-in functions; nevertheless this allows an easy way to extend \texttt{rpart}, and to test out new ideas with less effort than additions to the C base. For user defined splits only the first three functions are defined, and cross-validation does not happen automatically. (User defined methods are many times slower than the built in ones, so it may be preferable to skip cross-validation during the initial evaluation of a model.) \section{Anova function} As the first illustration we show how to add anova splitting as a user-written addition. This method is also one of those built into the C code, so we can also check the correctness of the results. \subsection{Initialization function} <<>>= itemp <- function(y, offset, parms, wt) { if (is.matrix(y) && ncol(y) > 1) stop("Matrix response not allowed") if (!missing(parms) && length(parms) > 0) warning("parameter argument ignored") if (length(offset)) y <- y - offset sfun <- function(yval, dev, wt, ylevel, digits ) { paste(" mean=", format(signif(yval, digits)), ", MSE=" , format(signif(dev/wt, digits)), sep = '') } environment(sfun) <- .GlobalEnv list(y = c(y), parms = NULL, numresp = 1, numy = 1, summary = sfun) } @ On input the function will be called with \begin{description} \item[y] the response value as found in the formula. Note that rpart will normally have removed any observations with a missing response. \item[offset] the offset term, if any, found on the right hand side of the formula \item[parms] the vector or list (if any) supplied by the user as a \texttt{parms} argument to the call. \item[wt] the weight vector from the call, if any \end{description} The last two arguments are optional. The initialization function needs to perform any data checks, deal with the offset vector if present, and return a list containing \begin{description} \item[y] the value of the response, possibly updated \item[numy] the number of columns of y \item[numresp] the length of the prediction vector for each node \item[summary] optional: a function which will produce a 1--3 line summary for the node, to be used by \texttt{summary.rpart}. \item[print] optional: a function which will produce a one line summary for the \texttt{print} function. \item[text] optional: a function which will produce a short label for the node, used by the \texttt{plot} function. \end{description} The parameters vector can contain whatever might be appropriate for the method, e.g. the variance of a prior distribution. The vector of parameters passed forward need not be he same as the set passed into the routine. In anova splitting there are no extra parameters. The summary function will be called with arguments which are \begin{itemize} \item yval= the response value for the node \item dev = the deviance for the node \item wt = the sum of weights at the node (number of observations) \item ylevel = the levels of y, if y is categorical \item digits = the current setting for digits \end{itemize} It should return a character vector. The text function will be called with these arguments plus two more \begin{itemize} \item n= the number of observations in the node \item use.n = TRUE/FALSE, see the help page for \texttt{text.rpart} \end{itemize} The print function is called only with yval, ylevels, and digits. The only puzzling line may be \texttt{environment(sfun) <- .GlobalEnv}. R ensures that any function can access variables that are \emph{not} passed in as arguments via a mechanism called environments. As a trivial example <<>>= temp <- 4 fun1 <- function(x) { q <- 15 z <- 10 fun2 <- function(y) y + z + temp fun2(x^2) } fun1(5) @ The definition of fun2 essentially contains a copy of \texttt{z} (and \texttt{q} as well) to ensure that this works. The exception to this is objects at the top level such as \texttt{temp} above; the user is responsible for the retention of those via their answer to the ``save'' question at the end of an R session. The consequence of this is that the summary function created by a call to itemp will by default have copies of all of the external variables that it \emph{might} make use of, in this case all of the input arguments. The summary function is in fact self-contained and makes reference only to its input arguments (doing otherwise is bad programming, in my opinion) and so needs none of these. Setting the environment of the function to \texttt{.GlobalEnv} in essence tells R to treat the function as though it had been defined at top level, i.e., the input prompt, and thus not attach these extraneous copies. I first became aware of this issue when using a huge data set, and found that the saved output of the fit took up an inordinate amount of disk space due to data copies attached to the summary function. Do not take this as a critique of R environments in general. The rules governing them need to be responsive to a large ensemble of conditions; in this particular case the results are not ideal but in the main they are the best compromise possible. \subsection{Evaluation function} The evaluation function is called once per node. It needs to produce a deviance for the node along with a vector to be used as a label for the node. <<>>= etemp <- function(y, wt, parms) { wmean <- sum(y*wt)/sum(wt) rss <- sum(wt*(y-wmean)^2) list(label = wmean, deviance = rss) } @ As an example of a longer label, the gini splitting method for categorical outcomes returns both the chosen group and the full vector of estimated group probabilities. The deviance value that is returned does not need to be an actual deviance associated with a log-likelihood, any impurity measure for the node will work. However, the pruning algorithm used during tree construction will be most efficient if the value 0 corresponds to a perfectly pure node. (The pruning code decides when a branch would be futile and further computations on it can thus be avoided.) \subsection{Splitting function} The splitting function is where the work lies. It will be called once for every covariate at each potential split. The input arguments are \begin{description} \item[y] vector or matrix of response values \item[wt] vector of weights \item[x] vector of x values \item[parms] vector of user parameters, passed forward \item[continuous] if TRUE the x variable should be treated as continuous \end{description} The data will have been subset to include only the non-missing subset of x and y at the node, and if x is continuous all values will be in the sorted order of the x variable. If x is continuous the routine should return two vectors each of length $n-1$, where $n$ is the length of x: \begin{description} \item[goodness] the utility of the split, where larger numbers are better. A value of 0 signifies that no worthwhile split could be found (for instance if y were a constant). The $i$th value of goodness compares a split of observations 1 to $i$ versus $i+1$ to $n$. \item[direction] A vector of the same length with values of -1 and +1, where -1 suggests that values with y $<$ cutpoint be sent to the left side of the tree, and a value of +1 that values with y $<$ cutpoint be sent to the right. This is not critical, but sending larger values of y to the right, as is done in the code below, seems to make the final tree easier to read. \end{description} The reason for returning an entire vector of goodness values is that the parent code is responsible for handling the minimum node size constraint, and also for dealing with ties. When x is continuous the split routine actually has no reason to look at x. <<>>= stemp <- function(y, wt, x, parms, continuous) { # Center y n <- length(y) y <- y- sum(y*wt)/sum(wt) if (continuous) { # continuous x variable temp <- cumsum(y*wt)[-n] left.wt <- cumsum(wt)[-n] right.wt <- sum(wt) - left.wt lmean <- temp/left.wt rmean <- -temp/right.wt goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2) list(goodness = goodness, direction = sign(lmean)) } else { # Categorical X variable ux <- sort(unique(x)) wtsum <- tapply(wt, x, sum) ysum <- tapply(y*wt, x, sum) means <- ysum/wtsum # For anova splits, we can order the categories by their means # then use the same code as for a non-categorical ord <- order(means) n <- length(ord) temp <- cumsum(ysum[ord])[-n] left.wt <- cumsum(wtsum[ord])[-n] right.wt <- sum(wt) - left.wt lmean <- temp/left.wt rmean <- -temp/right.wt list(goodness= (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2), direction = ux[ord]) } } @ The code above does the computations for all the split points at once by making use of two tricks. The first is to center the y values at zero (so the grand mean is zero), and the second takes advantage of the many ways to write the ``effect'' sum of squares for a simple two group anova. The key identity is \begin{equation} \sum (y_i - c)^2 = \sum (y_i - \overline y)^2 + n(c - \overline y)^2 \end{equation} If you have an old enough statistics book, this is used to show that for a 1-way anova the between groups sum of squares $SS_B$ is \begin{equation} SS_B = n_l(\overline y_l - \overline y)^2 + n_r(\overline y_r - \overline y)^2 \label{ssb} \end{equation} where $n_l$ is the number of observations in the left group, $\overline y_l$ the mean of $y$ in the left group, $\overline y$ the overall mean, and $n_r$ $\overline y_r$ the corresponding terms for the right hand group. Centering at zero makes $\overline y$ zero in \eqref{ssb}, and the terms then can be computed for all splits at once using the cumsum function. Extension of the formulas to case weights is left as an exercise for the reader. If the predictor variable x is categorical with $k$ classes then there are potentially $2^{k-1} -1$ different ways to split the node. However, for most splitting rules the optimal rule can be found by first ordering the groups by their average y value and then using the usual splitting rule on this ordered variable. For user mode rules this is assumed to be the case. The variable x is supplied as integer values 1, 2, \ldots, k. On return the direction vector should have $k$ values giving the ordering of the groups, and the goodness vector $k-1$ values giving the utility of the splits. \subsection{Test} We can test the above code to make sure that it gives the same answer as the built-in anova splitting method. <<>>= library(rpart) mystate <- data.frame(state.x77, region=state.region) names(mystate) <- casefold(names(mystate)) #remove mixed case ulist <- list(eval = etemp, split = stemp, init = itemp) fit1 <- rpart(murder ~ population + illiteracy + income + life.exp + hs.grad + frost + region, data = mystate, method = ulist, minsplit = 10) fit2 <- rpart(murder ~ population + illiteracy + income + life.exp + hs.grad + frost + region, data = mystate, method = 'anova', minsplit = 10, xval = 0) all.equal(fit1$frame, fit2$frame) all.equal(fit1$splits, fit2$splits) all.equal(fit1$csplit, fit2$csplit) all.equal(fit1$where, fit2$where) all.equal(fit1$cptable, fit2$cptable) @ The all.equal test can't be done on fit1 vs fit2 as a whole since their \texttt{call} component will differ. \section{Cross validation} To do cross-validation on user written rules one needs to use the \texttt{xpred.rpart} routine. This routine returns the predicted value(s) for each observation, predicted from a fit that did not include that observation. The result is a matrix with one row per subject and one column for each complexity parameter value. As an example we will replicate the cross-validation results for the anova model above. In order to get the same groupings we fix the xval group membership in advance. <<>>= xgroup <- rep(1:10, length = nrow(mystate)) xfit <- xpred.rpart(fit1, xgroup) xerror <- colMeans((xfit - mystate$murder)^2) fit2b <- rpart(murder ~ population + illiteracy + income + life.exp + hs.grad + frost + region, data = mystate, method = 'anova', minsplit = 10, xval = xgroup) topnode.error <- (fit2b$frame$dev/fit2b$frame$wt)[1] xerror.relative <- xerror/topnode.error all.equal(xerror.relative, fit2b$cptable[, 4], check.attributes = FALSE) @ \begin{figure} \myfig{usercode-fig1} \caption{Goodness of split for predicting income from illiteracy, using the state data.} \label{fig:state} \end{figure} <>= tdata <- mystate[order(mystate$illiteracy), ] n <- nrow(tdata) temp <- stemp(tdata$income, wt = rep(1, n), tdata$illiteracy, parms = NULL, continuous = TRUE) xx <- (tdata$illiteracy[-1] + tdata$illiteracy[-n])/2 plot(xx, temp$goodness, xlab = "Illiteracy cutpoint", ylab = "Goodness of split") lines(smooth.spline(xx, temp$goodness, df = 4), lwd = 2, lty = 2) @ \subsection{Smoothed anova} For any particular covariate consider a plot of x= split point vs y= goodness of split; figure \ref{fig:state} shows an example for the state data set, along with a smoothed curve. The plot points are very erratic. At one time we entertained the idea that a more stable split point would be found by finding the maximum value for a smoothed version of the plot. Exactly how to smooth is of course a major issue in such an endeavor. Modification of the anova splitting routine to test this is quite easy --- simply add a few lines to the stemp routine: \begin{verbatim} ... goodness= (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2) rx <- rank(x[-1]) #use only the ranks of x, to preserve invariance fit <- smooth.spline(rx, goodness, df=4) list(goodness= predict(fit, rx)$y, direction=sign(lmean)) } else { ... \end{verbatim} \section{Alternating logistic regression} Chen \cite{Chen07} used a mixed logistic regression model to fit clinical and genetic marker data. \begin{equation} E(y) = \frac{e^{\eta_1 + \eta_2}}{1+e^{\eta_1 + \eta_2}} \end{equation} where $\eta_1 = X\beta$ is a standard linear predictor based on clinical variables $X$ and $\eta_2$ is based on a tree model using a set of genetic markers $Z$. The solution was computed using alternate steps of an ordinary logistic regression on $X$ treating $\eta_2$ as an offset, and an rpart fit for $Z$ treating $\eta_1$ as an offset. For many splitting rules the initialization function can take care of offset terms once and for all by modifying the y vector, but this is not the case for logistic regression. In order to make the offset visible to the splitting function we create a 2 column ``response'' consisting of the original y vector in the first column and the offset in the second. <<>>= loginit <- function(y, offset, parms, wt) { if (is.null(offset)) offset <- 0 if (any(y != 0 & y != 1)) stop ('response must be 0/1') sfun <- function(yval, dev, wt, ylevel, digits ) { paste("events=", round(yval[,1]), ", coef= ", format(signif(yval[,2], digits)), ", deviance=" , format(signif(dev, digits)), sep = '')} environment(sfun) <- .GlobalEnv list(y = cbind(y, offset), parms = 0, numresp = 2, numy = 2, summary = sfun) } logeval <- function(y, wt, parms) { tfit <- glm(y[,1] ~ offset(y[,2]), binomial, weight = wt) list(label= c(sum(y[,1]), tfit$coef), deviance = tfit$deviance) } @ The evaluation function returns the number of positive responses along with the fitted coefficient. The splitting function attempts every possible logistic regression. A much faster version could almost certainly be written based on a score test, however. <<>>= logsplit <- function(y, wt, x, parms, continuous) { if (continuous) { # continuous x variable: do all the logistic regressions n <- nrow(y) goodness <- double(n-1) direction <- goodness temp <- rep(0, n) for (i in 1:(n-1)) { temp[i] <- 1 if (x[i] != x[i+1]) { tfit <- glm(y[,1] ~ temp + offset(y[,2]), binomial, weight = wt) goodness[i] <- tfit$null.deviance - tfit$deviance direction[i] <- sign(tfit$coef[2]) } } } else { # Categorical X variable # First, find out what order to put the categories in, which # will be the order of the coefficients in this model tfit <- glm(y[,1] ~ factor(x) + offset(y[,2]) - 1, binomial, weight = wt) ngrp <- length(tfit$coef) direction <- rank(rank(tfit$coef) + runif(ngrp, 0, 0.1)) #break ties # breaking ties -- if 2 groups have exactly the same p-hat, it # does not matter which order I consider them in. And the calling # routine wants an ordering vector. # xx <- direction[match(x, sort(unique(x)))] #relabel from small to large goodness <- double(length(direction) - 1) for (i in 1:length(goodness)) { tfit <- glm(y[,1] ~ I(xx > i) + offset(y[,2]), binomial, weight = wt) goodness[i] <- tfit$null.deviance - tfit$deviance } } list(goodness=goodness, direction=direction) } @ \section{Categorical splits} A common question from users concerns the splitting rules for a categorical predictor. If a categorical predictor has 5 levels, why don't we need to evaluate all $2^4 = 16$ possible ways of splitting this into two groups? The result is based on a fairly simple observation. Think of the choice of a split as two problems: selection of a pair of predicted values $p1$ and $p2$, and selection of the observations into two groups $g1$ and $g2$. Then the problem is to find \begin{equation*} \min_{g1, g2} \left[\min_{p1, p2} \sum_{i \in g1} E(y_i, p1) + \sum_{i \in g2} E(y_i, p2) \right] \end{equation*} where $E$ is our error in prediction function. The inner optimization is trivial for most problems, so we normally don't think about it. For a continuous $y$ variable for instance the error is $E(y,p) = (y-p)^2$ and $p1$ is the mean $y$ value of all subjects in group 1. Now reverse the order of the above equation, and instead think of the ``outer loop'' of the optimization as first a selection of the two predicted values $p1$ and $p2$, followed by finding the optimal grouping for those predicted values. When the predictor $x$ is categorical this reduces to checking whether $\sum E(y, p1)$ or $\sum E(y, p2)$ is smaller, when summed over all the observations having each particular level of $x$, and then assigning that level to group $g1$ or $g2$ accordingly. For squared error loss it is easy to see that this last step will be to assign a subset to group $g1$ if mean($y$), over that subset, is closer to $p1$ than to $p2$. Thus, whatever the values of $p1$ and $p2$, the best split of the data divides the groups according to low versus high values of $\overline y$. If $x$ has $k$ levels, this shows that the only \emph{admissible} splits are the $k-1$ groupings that first order the data according to the within level means: as $p1$ and $p2$ range over the entire real plane these are the only inner loop solutions that will arise. Knowing this, the computer code does not have to try all values of $p1$ and $p2$, it uses the first form of the minimum but can restrict the outer loop to this small set of cases. Each different type of $y$ value and loss function requires it's own version of the paragraph just above. For Poisson loss the optimal ordering turns out to be by the within level event rate, for instance. A similar partition can be shown for many loss functions. For categorical $y$ and the Gini criterion the result is particularly interesting. Assume that the response $y$ has $m$ levels, and summarize each of the distinct groups of observations corresponding to a categorical predictor $x$ by it proportion vector: the proportion with $y=1$, proportion with $y=2$ etc. If $x$ has $k$ levels, these proportions are $k$ points lying on the $m$ dimensional simplex, and the set of admissible splits are all partitions of these points by a plane. For the information criterion the points lie on a curved surface in $m$ space but again the admissible set are all partitions by a plane. When $m=2$ the points lie on a 1-dimensional curve and the problem reduces to ordering the groups by mean($y$). For $m>2$ the total number of admissible partitions of $k$ points can be much smaller than $2^k$ and efficient methods for enumerating them have been described \cite{Sleumer}, but this has not been incorporated into rpart. \begin{thebibliography}{9} \bibitem{Chen07} Chen J, Yu K, Hsing A and Therneau TM. \emph{A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects}, Genet Epidemiol, 2007(3), 238-51. \bibitem{Sleumer} Nora Sleumer, \emph{Hyperplane arrangements: construction visualization and applications}, PhD dissertation, Swiss Federal Institute of Technology, 1969. \end{thebibliography} \end{document}