[R] cross-validation in rpart

Tue May 26 10:32:40 CEST 2009

Dear R users,
I know cross-validation does not work in rpart with user defined split 
functions. As Terry Therneau suggested, one can use the xpred.rpart function 
and then summarize the matrix of the predicted values into a single 
"goodness" value.
I need only a confirmation: set for example xval=10, if I correctly 
understood a single column of the matrix obatined by xpred.rpart gives (for 
a cp level), for each of the 10 groups of obs, the value predicted by the 
tree obtained with the other 9 groups. Am I right ?
One more question: I want to compare the results obtained with a tree, say 
A, obtained with "class" method with the one, say B, I get with my custom 
functions (init, split and eval). I should compare the cp tables for the two 
fitted rpart object. For tree B I only have the "rel error" column and I 
need to obtain the xerror and the xstd columns as for tree A. To this aim I 
should know how this values are computed. I guess they depend on the xval 
value (in rpart.control) which is set to 10 by default. Does this mean that 
the observations are divided into 10 groups and, as before, the xerror is 
computed by averaging the erorrs one gets in predicting the class of  a 
group of obs by the tree obtained with the others 9 ? xstd is the standard 
deviation of this errors ?

Thank you for your help
Paolo Radaelli

Paolo Radaelli
Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali
Facoltà di Economia
Università degli Studi di Milano-Bicocca
Via Bicocca degli Arcimboldi, 8
20126 Milano
Italy
e-mail paolo.radaelli at unimib.it
Tel +39 02 6448 3163
Fax +39 02 6448 3105

begin included message
  a.. begin included message I'm having a problem with custom functions in 
rpart, and before I tear my hair out trying to fix it, I want to make sure 
it's actually a problem. It seems that, when you write custom functions for 
rpart (init, split and eval) then rpart no longer cross-validates the 
resulting tree to return errors. A simple test is to use the usersplits.R 
function to get a simple, custom rpart function, and then change fit1 and 
fit2 so that the both have xvals of 10. The problem occurs in that the 
cptable for fit1 doesn't have xerror or xstd, despite the fact that the 
cross-validation is set to 10-fold.
I guess I just need conformation that cross-validation doesn't work with 
custom functions, and if someone could explain to me why that is the case it 
would be greatly appreciated.

Thanks,
Sam Stewart

  a.. end inclusion
  You are right, cross-validation does not happen automatically with 
user-written split functions. We can think of cross-validation as having two 
steps:

  1.. Get the predicted values for each observation, when that obs (or a 
group) is left out of the data set. There is actually a vector of predicted 
values, one for each level of model complexity. This step can be done using 
xpred.rpart, which does work for user-defined splits. It returns a matrix 
with n rows (one per obs) and one column for each of the target cp values. 
Call this matrix "yhat".
  2.. Summarize each column of the above matrix yhat into a single 
"goodness" value. For anova fitting, for instance, this is just 
colMeans((y-yhat)^2). For classification models it is a bit more complex, we 
have to add up the expected loss L(y, hat) for each column using the loss 
matrix and the priors. The reason that rpart does not do this step for a 
user-written function is that rpart does not know what summary is 
appropriate. For some splitting rules, e.g. survival data split using a 
log-rank test, I'm not sure that \italics{I} know what summation is 
appropriate.
   Terry Therneau

end included message