[R] specifying model terms when using predict

David Winsemius dwinsemius at comcast.net
Fri Jan 16 22:44:14 CET 2009


On Jan 16, 2009, at 4:30 PM, Marc Schwartz wrote:

> on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:
>> I've recently encountered an issue when trying to use the predict.glm
>> function.
>>
>>
>>
>> I've gotten into the habit of using the dataframe$variablename  
>> method of
>> specifying terms in my model statements.  I thought this unambiguous
>> notation would be acceptable in all situations but it seems models
>> written this way are not accepted by the predict function.  Perhaps
>> others have encountered this problem as well.
>
> <snip>
>
> The bottom line is "don't do that".  :-)
>
> When the predict.*() functions look for the variable names, they use  
> the
> names as specified in the formula that was used in the initial  
> creation
> of the model object.
>
> As per ?predict.glm:
>
> Note
>
> Variables are first looked for in newdata and then searched for in the
> usual way (which will include the environment of the formula used in  
> the
> fit). A warning will be given if the variables found are not of the  
> same
> length as those in newdata if it was supplied.
>
>
> As per your example, using:
>
> x <- 1:100
>
> y <- 2 * x
>
> orig.df <- data.frame(x1 = x, y1 = y)
>
> lm1 <- glm(orig.df$y1 ~ orig.df$x1, data = orig.df)
>
> pred1 <- predict(lm1, newdata = data.frame(x1 = 101:150))
>
>
> When predict.glm() tries to locate the variable "orig.df$x1" in the  
> data
> frame passed to 'newdata', it cannot be found. The correct name in the
> model is "orig.df$x1", not "x1" as you used above.
>
> Thus, since it cannot find that variable in 'newdata', it begins to  
> look
> elsewhere for a variable called "orig.df$x1". Guess what?  It finds it
> in the global environment as a column the original dataframe  
> 'orig.df'.
>
> Since that column has a length of 100 and the data frame that you  
> passed
> to newdata only has 50, you get an error.
>
> Warning message:
>
> 'newdata' had 50 rows but variable(s) found have 100 rows

Mark;

Knowing your skill level, which far exceeds mine, you probably do know  
that it was not an error, only a warning, and the assignment to pred1  
proceeded (as you described), just not the assignment that VanHezewijk  
expected. "newdata" was ignored, orig.df$x1 was found and no  
extrapolation occurred.

-- 
David

>
>
>
> There is a "method" to the madness and good reason why the modeling
> functions and others that take a formula argument also have a 'data'
> argument to specify the location of the variables to be used.
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list