[R] General query regarding scoring new observations
Rolf Turner
r.turner at auckland.ac.nz
Thu Feb 12 02:29:43 CET 2009
On 12/02/2009, at 2:02 PM, Lars Bishop wrote:
> Hi,
> I was wondering if I can have some advice on the following problem.
>
> Let's say that I have a problem in which I want to predict a binary
> outcome
> and I use logistic regression for that purpose. In addition,
> suppose that my
> model includes predictors that will not be used in scoring new
> observations
> but must be used during model training to absorb certain effects
> that could
> bias the parameter estimates of the other variables.
>
> Because one needs to have the same predictors in model development and
> scoring, how it is usually done in practice to overcome this
> problem? I
> could exclude the variables that will not be available during
> scoring, but
> that will bias the estimates for the other variables.
Surely if you only have x_1, x_2, and x_3 available for prediction,
then you should ``train'' using only x_1, x_2, and x_3.
If you also have x_4 and x_5 available for training then not using them
will ``bias'' the coefficients of the other three predictors, but will
give the best (in some sense) values of these coefficients to use when
x_4 and x_5 are not available.
Note that not using x_4 and x_5 is equivalent to setting them equal
to 0,
so if you *insist* on fitting the model with x_1, ..., x_5 and then
predicting
with x_1, ..., x_3 you can accomplish this by setting x_4 and x_5
equal to 0
in your ``newdata'' data frame.
This seems to me to be highly inadvisable however.
cheers,
Rolf Turner
######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
More information about the R-help
mailing list