Hi:
The problem arises because the variable names of the explanatory variables
in the newdata =
data frame used in predict() have to match those in the fitted model object.
Interestingly, using
a matrix for the right hand side of the model formula in lm() creates
problems for predict().
Using your code,
> x <- matrix(rnorm(30), ncol =2)
> y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15)
> m0 <- lm(y ~ x)
> m0
...
Coefficients:
(Intercept) x1 x2
0.590281 4.868230 -0.007012
> new_x <- matrix(rnorm(2), ncol =2)
> new_x.d <- data.frame(new_x)
> new_x.d
X1 X2
1 0.1225315 0.8099963
The names of the covariates in the model have names x1 and x2, whereas those
in the
data frame you want to use in predict() are X1 and X2, creating a name
mismatch.
The apparent 'solution' is to change the names in new_x.d to lower case, but
interesting things happen...
> names(new_x.d) <- c('x1', 'x2')
> predict(m0, new_x.d)
1 2 3 4 5 6 7
1.1734885 -5.5551829 3.5652911 7.9607333 -9.4959770 4.3378850 -3.5098720
8 9 10 11 12 13 14
-2.1571867 3.8502343 5.8451436 -6.7490334 0.2203290 -4.2810391 0.4988267
15
6.8596084
Warning message:
'newdata' had 1 rows but variable(s) found have 15 rows
> new_x.d
x1 x2
1 0.1225315 0.8099963
Even though the names (apparently) match now, predict() returns the
predicted values from the original
input *matrix*, and that turns out to matter...
Let's go back to x and put some column names on it, refit the model and try
predict() again:
> colnames(x) <- c('x1', 'x2')
> class(x)
[1] "matrix"
> m1 <- lm(y ~ x)
> predict(m1, new_x.d)
# Same as above...
Although the variable names in the input matrix and new_x.d now match,
predict()
still 'misbehaves'. To see why,
> m1
...
Coefficients:
(Intercept) xx1 xx2
0.590281 4.868230 -0.007012
lm() tacks a leading x onto the variable names, thus causing another
mismatch with
variable names in predict().
Now, combine x and y into a data frame, refit the model and try predict()
again:
> xx <- data.frame(y, x)
# verify that it's a data frame with the right variable names...
> str(xx)
'data.frame': 15 obs. of 3 variables:
$ y : num 0.236 -6.069 2.687 7.323 -10.028 ...
$ x1: num 0.12 -1.261 0.611 1.514 -2.069 ...
$ x2: num 0.367 1.192 -0.102 0.117 1.66 ...
# Refit the model and run predict() again:
> m2 <- lm(y ~ ., data = xx)
> predict(m2, new_x.d)
1
1.181113
Now it works.
Evidently, inputting a matrix for the right hand side of the model formula
in lm() creates
problems for predict(). According to the help page, the first argument of
predict.lm() is
an object of class lm, whereas the second argument is a data frame. As it
turns out, the
key phrase needed to understand what's going on is the following:
predict.lm produces predicted values, obtained by evaluating the regression
function in the frame newdata
(which defaults to model.frame(object)).
The names of the model.frame() objects in the three models are:
> names(model.frame(m0)) # x is a matrix, no colnames
[1] "y" "x"
> names(model.frame(m1)) # x is a matrix with colnames
[1] "y" "x"
> names(model.frame(m2)) # x1 and x2 are variables in a data frame
[1] "y" "x1" "x2"
Notice that these are the same as the objects given in the respective model
formulas.
Moreover,
> head(model.frame(m0), 1)
y x.1 x.2
1 0.2355153 0.1203279 0.3674401
> head(model.frame(m1), 1)
y x.x1 x.x2
1 0.2355153 0.1203279 0.3674401
> head(model.frame(m2), 1)
y x1 x2
1 0.2355153 0.1203279 0.3674401
Now, one can see that the names assigned to the covariates by model.frame()
when x is a
matrix depend on the column names assigned to the input matrix. Does this
help?
Let's copy new_x.d to another data frame object and rename the variables for
prediction with m0:
> new0 <- new_x.d
> names(new0) <- c('x.1', 'x.2')
> predict(m0, new0)
1 2 3 4 5 6 7
1.1734885 -5.5551829 3.5652911 7.9607333 -9.4959770 4.3378850 -3.5098720
8 9 10 11 12 13 14
-2.1571867 3.8502343 5.8451436 -6.7490334 0.2203290 -4.2810391 0.4988267
15
6.8596084
Warning message:
'newdata' had 1 rows but variable(s) found have 15 rows
> new0
x.1 x.2
1 0.1225315 0.8099963
That doesn't help, either. lm() is not recognizing x.1 and x.2 as variable
names in the model
frame of m0, and this is seen in names(model.frame(m0)).
The moral seems to be: to use predict() predictably, make sure that the
inputs to lm() are
in a data frame. One experiences far fewer headaches that way.
A clearer, pithier explanation of why this phenomenon occurs would be
welcome, too :)
HTH,
Dennis
On Wed, May 5, 2010 at 3:16 AM, Paolo Agnolucci
wrote:
> Hi everyone,
>
> this should be pretty basic but I need asking for help as I got stuck.
>
> I am running simple linear regression models on R with k regressors where k
> > 1. In order to automate my code I packed all the regressors in a matrix X
> so that lm(y~X) will always produce the results I want regardless of the
> variables in X. I am new to R but I found this advice somewhere so I guess
> it is relatively standard practice. This works very well until I need to
> forecast using the estimate model.
>
> I cannot pass a matrix to predict - when I pass a data frame I get the
> fitted valuie which leads me to think that R doesnt see the data.frame I
> pass to predict
>
> Thanks in advance,
>
> Paolo
>
>
>
> # REPRODUCIBLE CODE
> x <- matrix(rnorm(30), ncol =2)
> y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15)
> new_x <- matrix(rnorm(2), ncol =2)
> new_x.d <- data.frame(new_x)
>
> # fitted values
> predict(lm(y ~ x))
>
> # same as fitted values
> predict(lm(y ~ x), new_x.d)
>
> # error
> predict(lm(y ~ x), new_x)
>
>
