[R] rpart package: why does predict.rpart require values for "unused" predictors?
Jason Roberts
jason.roberts at duke.edu
Thu Aug 2 16:44:01 CEST 2012
Jean,
Thanks for your quick reply and suggestions!
> In the help file for predict.rpart it says, "The predictors referred to in
> the right side of formula(object) must be present by name in newdata."
I was aware of that statement from the help file. I wondered about the
reason for that requirement. It would be convenient for the caller to not
have to provide values for unused predictors. I wondered whether the
requirement to provide them all was related to something I did not
understand, such as surrogate splits, or whether imposing it simply made
rpart itself easier to implement. (No offence intended to the authors for
taking a shortcut, if indeed they did.)
Are you pretty confident that your suggested workarounds will result in a
model that produces identical predictions? I only ask because I'm aware that
rpart has the ability to use surrogate variables in place of predictors that
are missing. But I do not fully understand how that capability works. I do
not know whether it is only used during fitting and not prediction.
Continuing my example, I can see that printcp produces some output
"Variables actually used in tree construction":
> printcp(model)
Regression tree:
rpart(formula = Mileage ~ Weight + Disp. + HP, data = car.test.frame)
Variables actually used in tree construction:
[1] Disp. Weight
...
I can see in the source for printcp how those variables were obtained. But
when doing predictions, is it really safe to only provide them and not HP,
if I expect that there could be missing values for them? When I call
summary, I can see surrogate splits that reference the HP variable:
> summary(model)
Call:
rpart(formula = Mileage ~ Weight + Disp. + HP, data = car.test.frame)
n= 60
CP nsplit rel error xerror xstd
1 0.62840234 0 1.0000000 1.0326274 0.17828576
2 0.12032318 1 0.3715977 0.5271278 0.08627909
3 0.04293478 2 0.2512745 0.4092689 0.07260291
4 0.01000000 3 0.2083397 0.3629544 0.06865150
Node number 1: 60 observations, complexity param=0.6284023
mean=24.58333, MSE=22.57639
left son=2 (35 obs) right son=3 (25 obs)
Primary splits:
Disp. < 134 to the right, improve=0.6284023, (0 missing)
Weight < 2567.5 to the right, improve=0.5953491, (0 missing)
HP < 104.5 to the right, improve=0.4085043, (0 missing)
Surrogate splits:
Weight < 2747.5 to the right, agree=0.900, adj=0.76, (0 split)
HP < 104.5 to the right, agree=0.817, adj=0.56, (0 split)
...
Assuming that the answer is:
1. The best predictions will be obtained by providing values for the
variables "actually used in tree construction" plus those used as
surrogates, and:
2. If a variable is neither "actually used in tree construction" nor as a
surrogate, it can be safely set to NA for the prediction.
Do you know of a way to easily identify the variables used as surrogates?
Thanks again for your help, and sorry to write a book in response,
Jason
More information about the R-help
mailing list