[R] predict.lm with new regressor names

Thu Dec 16 09:07:26 CET 2010

Thanks very much Dennis and Michael. That makes sense. I had some very
sloppy code which did the equivalent of

y<-rnorm(100)
x<-list()
x[[1]]<-rnorm(100)
x[[2]]<-rnorm(100)
lm.yx<-lm(y~x[[1]]+x[[2]])

To predict from lm.yx, I was trying to feed predict.lm a data.frame
with columns called x[[1]] and x[[2]]. Of course, in retrospect, that
makes no sense. Following on Michael's suggestion, I did

predict(lm.yx, newdata=list(x=list(rnorm(100), rnorm(100))))

which works as expected. I agree, messing with the lm object is
definitely a bad idea. I was just having a really hard time feeding
predict.lm with the correct objects and getting desperate.

Thanks again!

Best,
Anirban

--
Anirban Mukherjee | Assistant Professor, Marketing
LKCSB, Singapore Management University
5056 School of Business, 50 Stamford Road
Singapore 178899 | +65-6828-1932

On Thu, Dec 16, 2010 at 3:22 PM, Dennis Murphy <djmuser at gmail.com> wrote:
> Hi:
>
> On Wed, Dec 15, 2010 at 10:05 PM, Anirban Mukherjee <am253 at cornell.edu>
> wrote:
>>
>> Hi all,
>>
>> Suppose:
>>
>> y<-rnorm(100)
>> x1<-rnorm(100)
>> lm.yx<-lm(y~x1)
>>
>> To predict from a new data source, one can use:
>>
>> # works as expected
>> dum<-data.frame(x1=rnorm(200))
>> predict(lm.yx, newdata=dum)
>
> Yup.
>>
>> Suppose lm.yx has been run and we have the lm object. And we have a
>> dataframe that has columns that don't correspond by name to the
>> original regressors. I very! naively assumed that doing this (below)
>> would work. It does not.
>>
>> # does not work
>> lm.yx$coefficients<-c("Intercept", "n.x1")
>> dum2<-data.frame(Int=rep(1,200), n.x1=rnorm(200))
>> predict(lm.yx, newdata=dum2)
>>
> No, it does not. Compare the names of lm.yx$coefficients with those of
> model.matrix(lm.yx). If they're different, you have a problem. Here's a toy
> example to illustrate it:
>
> x1 <- seq(from = 10, to = 200, by = 10)
> x2 <- rpois(20, 15)
> y <- 2 + 0.7 * x1 + 0.5 * x2 + rnorm(20, 0, 0.4)
> m <- lm(y ~ x1 + x2)
> u <- data.frame(x1 = c(175, 210), x2 = c(20, 25))
> predict(m, newdata = u)
>        1        2
> 134.7912 161.7939
> coef(m)
> (Intercept)          x1          x2
>   2.1925618   0.7025116   0.4829565
>
> # No problem thus far. Let's change the names on the prediction data frame:
>
> names(u) <- c('x3', 'x4')
> predict(m, newdata = u)
>         1         2         3         4         5         6
> 7         8
>  15.49611  21.55532  31.47817  37.53737  44.07953  51.58761  58.61272
> 66.60375
>         9        10        11        12        13        14        15
> 16
>  70.73113  79.20511  86.23023  92.28943  99.31455 108.27149 115.29661
> 117.49216
>        17        18        19        20
> 130.31275 136.85491 142.91412 152.83697
> Warning message:
> 'newdata' had 2 rows but variable(s) found have 20 rows
>
> # Change the names on the coefficients
> names(m$coefficients)[c(2, 3)] <- c('x3', 'x4')
> coef(m)
> (Intercept)          x3          x4
>   2.1925618   0.7025116   0.4829565
> predict(m, newdata = u)       # same as above
>
> # \hat{y} = X \hat{\beta}, so let's look at the model matrix:
> names(model.matrix(m))
> NULL                                         # Oops...it's a matrix!
> colnames(model.matrix(m))
> [1] "(Intercept)" "x1"          "x2"
>
> # ...this is after you redefined the names of the model coefficients.
> # Now try to fix the names of the model matrix...
> colnames(model.matrix(m))[c(2, 3)] <- c('x3', 'x4')
> Error in colnames(model.matrix(m))[c(2, 3)] <- c("x3", "x4") :
>   could not find function "model.matrix<-"
>
> There's a way to do this with replacement functions at a
> fundamental level, but it takes some work. Even if you get that,
> there will still be potential issues - e.g., if you want to compute
> confidence/prediction intervals.
>
> lm() returns a list object with these names:
> names(m)
>  [1] "coefficients"  "residuals"     "effects"       "rank"
>  [5] "fitted.values" "assign"        "qr"            "df.residual"
>  [9] "xlevels"       "call"          "terms"         "model"
>
> Now look at the names of the following:
>
> m$effects
> m$xlevels
> m$call
> m$terms
> m$model
> m$qr
>
> Do you really want to go through the model object and change all of that for
> a few variable names?
>
>> I know that a simple alternative is to do:
>>
>> # because we messed around with the lm object above, re-building
>> lm.yx<-lm(y~x1)
>>
>> # change names of dum2 to match names of coefficients of lm.yx
>> names(dum2)<-names(coefficients(lm.yx))
>> predict(lm.yx, newdata=dum2)
>>
>> Is there another way that involves changing the lm object rather than
>> changing the prediction data.frame?
>
> It seems to me that changing the lm object post hoc is wasted energy. You
> won't have any problem if you coordinate the names of the prediction data
> frame with those of the input data, which should ideally be a data frame
> itself. The most hassle-free strategy is to input a data frame into lm()
> that contains the variables you want to model, and generate a set of
> prediction points with a data frame whose columns have the same names as
> those on the RHS of the model formula.
>
> HTH,
> Dennis
>>
>> Thanks,
>> Anirban
>>
>> --
>> Anirban Mukherjee | Assistant Professor, Marketing
>> LKCSB, Singapore Management University
>> 5056 School of Business, 50 Stamford Road
>> Singapore 178899 | +65-6828-1932
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>