[R] specifying model terms when using predict
David Winsemius
dwinsemius at comcast.net
Fri Jan 16 22:44:14 CET 2009
On Jan 16, 2009, at 4:30 PM, Marc Schwartz wrote:
> on 01/16/2009 02:20 PM VanHezewijk, Brian wrote:
>> I've recently encountered an issue when trying to use the predict.glm
>> function.
>>
>>
>>
>> I've gotten into the habit of using the dataframe$variablename
>> method of
>> specifying terms in my model statements. I thought this unambiguous
>> notation would be acceptable in all situations but it seems models
>> written this way are not accepted by the predict function. Perhaps
>> others have encountered this problem as well.
>
> <snip>
>
> The bottom line is "don't do that". :-)
>
> When the predict.*() functions look for the variable names, they use
> the
> names as specified in the formula that was used in the initial
> creation
> of the model object.
>
> As per ?predict.glm:
>
> Note
>
> Variables are first looked for in newdata and then searched for in the
> usual way (which will include the environment of the formula used in
> the
> fit). A warning will be given if the variables found are not of the
> same
> length as those in newdata if it was supplied.
>
>
> As per your example, using:
>
> x <- 1:100
>
> y <- 2 * x
>
> orig.df <- data.frame(x1 = x, y1 = y)
>
> lm1 <- glm(orig.df$y1 ~ orig.df$x1, data = orig.df)
>
> pred1 <- predict(lm1, newdata = data.frame(x1 = 101:150))
>
>
> When predict.glm() tries to locate the variable "orig.df$x1" in the
> data
> frame passed to 'newdata', it cannot be found. The correct name in the
> model is "orig.df$x1", not "x1" as you used above.
>
> Thus, since it cannot find that variable in 'newdata', it begins to
> look
> elsewhere for a variable called "orig.df$x1". Guess what? It finds it
> in the global environment as a column the original dataframe
> 'orig.df'.
>
> Since that column has a length of 100 and the data frame that you
> passed
> to newdata only has 50, you get an error.
>
> Warning message:
>
> 'newdata' had 50 rows but variable(s) found have 100 rows
Mark;
Knowing your skill level, which far exceeds mine, you probably do know
that it was not an error, only a warning, and the assignment to pred1
proceeded (as you described), just not the assignment that VanHezewijk
expected. "newdata" was ignored, orig.df$x1 was found and no
extrapolation occurred.
--
David
>
>
>
> There is a "method" to the madness and good reason why the modeling
> functions and others that take a formula argument also have a 'data'
> argument to specify the location of the variables to be used.
>
> HTH,
>
> Marc Schwartz
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list