[R] General query regarding scoring new observations

Rolf Turner r.turner at auckland.ac.nz
Thu Feb 12 02:29:43 CET 2009


On 12/02/2009, at 2:02 PM, Lars Bishop wrote:

> Hi,
> I was wondering if I can have some advice on the following problem.
>
> Let's say that I have a problem in which I want to predict a binary  
> outcome
> and I use logistic regression for that purpose. In addition,  
> suppose that my
> model includes predictors that will not be used in scoring new  
> observations
> but must be used during model training to absorb certain effects  
> that could
> bias the parameter estimates of the other variables.
>
> Because one needs to have the same predictors in model development and
> scoring, how it is usually done in practice to overcome this  
> problem? I
> could exclude the variables that will not be available during  
> scoring, but
> that will bias the estimates for the other variables.

Surely if you only have x_1, x_2, and x_3 available for prediction,
then you should ``train'' using only x_1, x_2, and x_3.

If you also have x_4 and x_5 available for training then not using them
will ``bias'' the coefficients of the other three predictors, but will
give the best (in some sense) values of these coefficients to use when
x_4 and x_5 are not available.

Note that not using x_4 and x_5 is equivalent to setting them equal  
to 0,
so if you *insist* on fitting the model with x_1, ..., x_5 and then  
predicting
with x_1, ..., x_3 you can accomplish this by setting x_4 and x_5  
equal to 0
in your ``newdata'' data frame.

This seems to me to be highly inadvisable however.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}



More information about the R-help mailing list