[R] Using predict.lm()

Thu Jun 17 16:25:30 CEST 2004

On Thu, 17 Jun 2004, Steven White wrote:

> Following the example in help(predict.lm):
> 
>      x <- rnorm(15)
>      y <- x + rnorm(15)
>      new <- data.frame(x = seq(-3, 3, 0.5))
>      predict(lm(y ~ x), new)
> 
> predicts the response elements corresponding to new$x as can be viewed by:
> 
>      plot(x,y)
>      lines(new$x,predict(lm(y ~ x), new))

Note that the model is fitted to `x' and new contains `x'.  You haven't 
copied that.

> I am trying to extend this fitting and prediction over a variety of factors as 
> follows:
> 
>      f<-rep(c("FIRST","SECOND"),each=15)
>      f<-as.factor(f)
>      x<-rep(rnorm(15),2)
>      y<-x+rnorm(length(x))
>      old<-data.frame(f=f,x=x,y=y)
>      new<-data.frame(f=rep(levels(f),each=length(seq(-4,4,0.2))),x=seq(-4,4,0.2))
> 
> ...where variable new simply substitutes a differing domain than old. When I 
> try to predict on the frame new using x & y, I get a response that 
> corresponds to the length of new:
> 
>      predict(lm(y~x),new)
> 
> but when I use the same variables from within the frame old, 

That you have not done correctly: see ?lm.

> the frame new is ignored:

No, it is not ignored but it does not contain a variable named `old$x' and 
your workspace does.  newdata is the first place to look for variables, 
but not the only place.

>      predict(lm(old$y~old$x),new)
> 
> ...results in a response the length of old$x (presumably predicting over the 
> values of old$x). Furthermore, this behavior also precludes using something 
> more useful, i.e.:
> 
>      predict(lm(old$y~old$f/(1+old$x)-1),new)
> 
> to return predictions over a number of factors over redefined domains. In my 
> case, I am attempting to do 2nd order polynomial fitting over noisy data 
> collected for a large number of factors (~85). The data were collected for 
> each factor at convenient (and therefore dissimilar) points within a common 
> domain, but I need to compare the responses of each factor at similar points 
> within the common domain.
> 
> I am obviously missing something here because I continue to be puzzled by the 
> result. I had thought (perhaps erroneously) that lm() would return a model 
> object that would permit prediction. 

Indeed it does.

> Indeed:
> 
>      lm(old$y~old$f/(1+old$x)-1)
> 
> ...results in:
> 
> Call:
> lm(formula = old$y ~ old$f/(1 + old$x) - 1)
> 
> Coefficients:
>        old$fFIRST        old$fSECOND   old$fFIRST:old$x  old$fSECOND:old$x
>          -0.08489           -0.05839            1.15351            0.72981
> 
> which clearly provides a model fit for each factor, and identifies the factor 
> from which each model coefficient was extracted, so lm() does provide the 
> capability to predict over the factors. It seems however (as nearly as I can 
> tell), that predict simply ignores the frame new altogether, failing even to 
> provide a warning.

Nope.  You just haven't set new to match your fit.

> Is this the intended behavior? Have I missed something very simple or have a 
> fundamental misunderstanding of how this should work?

Yes, yes.  You should be using

	lm(y ~ f/(1+x)-1, data=old)

etc, although in your example you could omit data=old.  That is in all 
good books on the S language ....

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595