[R] rpart package: why does predict.rpart require values for "unused" predictors?

Thu Aug 2 16:44:01 CEST 2012

Jean,

Thanks for your quick reply and suggestions! 

> In the help file for predict.rpart it says, "The predictors referred to in
> the right side of formula(object) must be present by name in newdata."

I was aware of that statement from the help file. I wondered about the
reason for that requirement. It would be convenient for the caller to not
have to provide values for unused predictors. I wondered whether the
requirement to provide them all was related to something I did not
understand, such as surrogate splits, or whether imposing it simply made
rpart itself easier to implement. (No offence intended to the authors for
taking a shortcut, if indeed they did.)

Are you pretty confident that your suggested workarounds will result in a
model that produces identical predictions? I only ask because I'm aware that
rpart has the ability to use surrogate variables in place of predictors that
are missing. But I do not fully understand how that capability works. I do
not know whether it is only used during fitting and not prediction.

Continuing my example, I can see that printcp produces some output
"Variables actually used in tree construction":

> printcp(model)

Regression tree:
rpart(formula = Mileage ~ Weight + Disp. + HP, data = car.test.frame)

Variables actually used in tree construction:
[1] Disp.  Weight

...

I can see in the source for printcp how those variables were obtained. But
when doing predictions, is it really safe to only provide them and not HP,
if I expect that there could be missing values for them? When I call
summary, I can see surrogate splits that reference the HP variable:

> summary(model)
Call:
rpart(formula = Mileage ~ Weight + Disp. + HP, data = car.test.frame)
  n= 60 

          CP nsplit rel error    xerror       xstd
1 0.62840234      0 1.0000000 1.0326274 0.17828576
2 0.12032318      1 0.3715977 0.5271278 0.08627909
3 0.04293478      2 0.2512745 0.4092689 0.07260291
4 0.01000000      3 0.2083397 0.3629544 0.06865150

Node number 1: 60 observations,    complexity param=0.6284023
  mean=24.58333, MSE=22.57639 
  left son=2 (35 obs) right son=3 (25 obs)
  Primary splits:
      Disp.  < 134    to the right, improve=0.6284023, (0 missing)
      Weight < 2567.5 to the right, improve=0.5953491, (0 missing)
      HP     < 104.5  to the right, improve=0.4085043, (0 missing)
  Surrogate splits:
      Weight < 2747.5 to the right, agree=0.900, adj=0.76, (0 split)
      HP     < 104.5  to the right, agree=0.817, adj=0.56, (0 split)

...

Assuming that the answer is:

1. The best predictions will be obtained by providing values for the
variables "actually used in tree construction" plus those used as
surrogates, and:

2. If a variable is neither "actually used in tree construction" nor as a
surrogate, it can be safely set to NA for the prediction.

Do you know of a way to easily identify the variables used as surrogates?

Thanks again for your help, and sorry to write a book in response,

Jason