[R] specifying model terms when using predict

Fri Jan 16 23:01:25 CET 2009

on 01/16/2009 03:44 PM David Winsemius wrote:
> 
> On Jan 16, 2009, at 4:30 PM, Marc Schwartz wrote:
> 
>> on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:
>>> I've recently encountered an issue when trying to use the predict.glm
>>> function.
>>>
>>>
>>>
>>> I've gotten into the habit of using the dataframe$variablename method of
>>> specifying terms in my model statements.  I thought this unambiguous
>>> notation would be acceptable in all situations but it seems models
>>> written this way are not accepted by the predict function.  Perhaps
>>> others have encountered this problem as well.
>>
>> <snip>
>>
>> The bottom line is "don't do that".  :-)
>>
>> When the predict.*() functions look for the variable names, they use the
>> names as specified in the formula that was used in the initial creation
>> of the model object.
>>
>> As per ?predict.glm:
>>
>> Note
>>
>> Variables are first looked for in newdata and then searched for in the
>> usual way (which will include the environment of the formula used in the
>> fit). A warning will be given if the variables found are not of the same
>> length as those in newdata if it was supplied.
>>
>>
>> As per your example, using:
>>
>> x <- 1:100
>>
>> y <- 2 * x
>>
>> orig.df <- data.frame(x1 = x, y1 = y)
>>
>> lm1 <- glm(orig.df$y1 ~ orig.df$x1, data = orig.df)
>>
>> pred1 <- predict(lm1, newdata = data.frame(x1 = 101:150))
>>
>>
>> When predict.glm() tries to locate the variable "orig.df$x1" in the data
>> frame passed to 'newdata', it cannot be found. The correct name in the
>> model is "orig.df$x1", not "x1" as you used above.
>>
>> Thus, since it cannot find that variable in 'newdata', it begins to look
>> elsewhere for a variable called "orig.df$x1". Guess what?  It finds it
>> in the global environment as a column the original dataframe 'orig.df'.
>>
>> Since that column has a length of 100 and the data frame that you passed
>> to newdata only has 50, you get an error.
>>
>> Warning message:
>>
>> 'newdata' had 50 rows but variable(s) found have 100 rows
> 
> Mark;
> 
> Knowing your skill level, which far exceeds mine, you probably do know
> that it was not an error, only a warning, and the assignment to pred1
> proceeded (as you described), just not the assignment that VanHezewijk
> expected. "newdata" was ignored, orig.df$x1 was found and no
> extrapolation occurred.

David,

Excellent correction.

For additional clarification:

> str(fitted(lm1))
 Named num [1:100] 2 4 6 8 10 ...
 - attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...

> str(pred1)
 Named num [1:100] 2 4 6 8 10 ...
 - attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...

> all(fitted(lm1) == pred1)
[1] TRUE

which reinforces David's comment that the values in 'pred1' are the same
100 fitted values from the original model, covering x values 1:100.

This is reinforced in ?predict.glm, in the description of 'newdata':

optionally, a data frame in which to look for variables with which to
predict. If omitted, the fitted linear predictors are used.

Note that I can get away using "==" above as the fitted values are all
integers here, as opposed to having to use all.equal() or another
approach had the values been floats.

Thanks David for pointing out the distinction and my own error.

Marc