[R] specifying model terms when using predict
Marc Schwartz
marc_schwartz at comcast.net
Fri Jan 16 22:30:23 CET 2009
on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:
> I've recently encountered an issue when trying to use the predict.glm
> function.
>
>
>
> I've gotten into the habit of using the dataframe$variablename method of
> specifying terms in my model statements. I thought this unambiguous
> notation would be acceptable in all situations but it seems models
> written this way are not accepted by the predict function. Perhaps
> others have encountered this problem as well.
<snip>
The bottom line is "don't do that". :-)
When the predict.*() functions look for the variable names, they use the
names as specified in the formula that was used in the initial creation
of the model object.
As per ?predict.glm:
Note
Variables are first looked for in newdata and then searched for in the
usual way (which will include the environment of the formula used in the
fit). A warning will be given if the variables found are not of the same
length as those in newdata if it was supplied.
As per your example, using:
x <- 1:100
y <- 2 * x
orig.df <- data.frame(x1 = x, y1 = y)
lm1 <- glm(orig.df$y1 ~ orig.df$x1, data = orig.df)
pred1 <- predict(lm1, newdata = data.frame(x1 = 101:150))
When predict.glm() tries to locate the variable "orig.df$x1" in the data
frame passed to 'newdata', it cannot be found. The correct name in the
model is "orig.df$x1", not "x1" as you used above.
Thus, since it cannot find that variable in 'newdata', it begins to look
elsewhere for a variable called "orig.df$x1". Guess what? It finds it
in the global environment as a column the original dataframe 'orig.df'.
Since that column has a length of 100 and the data frame that you passed
to newdata only has 50, you get an error.
Warning message:
'newdata' had 50 rows but variable(s) found have 100 rows
There is a "method" to the madness and good reason why the modeling
functions and others that take a formula argument also have a 'data'
argument to specify the location of the variables to be used.
HTH,
Marc Schwartz
More information about the R-help
mailing list