[R] Am failing on making lagged residual after regression

Tue Mar 9 09:28:24 CET 2004

If you have missing data in your data frame and want residuals for all 
observations, you need to use na.action=na.exclude, not the default 
na.omit.

As for lag, its description says

Description:

     Compute a lagged version of a time series, shifting the time base
     back by a given number of observations.

and you don't have a time series.  It works by shifting the time base for 
a time series, not by moving the contents of a vector.

On Mon, 8 Mar 2004, Ajay Shah wrote:

> Folks,
> 
> I'm most confused in trying to do something that (I thought) out to be
> mainstream and straightforward R. :-) Could you please help?
> 
> I am doing an ordinary linear regression. My goal is: After a
> regression, to make residuals, and make a new variable which is the
> lagged residuals (lagged by 1). I will use this variable in a 2nd
> stage regression (for an error-correcting model).
> 
> This sounds simple and reasonable, and should be right up R's alley,
> but I am just not able to do this. Can I please show you the steps
> which I'm trying and failing in?
> 
> I start with:
> 
> > m = lm(NNDA ~ NFA + NFA.x.d1 + NFA.x.d2 + IIP.n + CRR, D.f)
> > e = residuals(m)
> > print(e)
>           34           35           36           37           38           39 
>  -5073.24843  -4210.27886  -8218.01782  -1489.10583  -4426.11738 -11332.56052 
>   (lines deleted)
>           64           65           66           67           68           69 
>   8362.93776   7564.14324   2311.41208   7660.00638  -1271.04645 -10917.29418 
>   (lines deleted)
>          160          161          162          163          164          165 
>   3858.94591 -11783.04370 -21438.33646   1859.49628  -4988.82853 -25172.43241 
> 
> Here, the residuals only started at the 34th observation owing to
> missing data in my data frame. This is correct and sensible. The
> dataset is 167 observations, but 166 and 167 are also missing data and
> dropped.
> 
> I tried to use lag(e,1) to make a new vector and failed. I think I am
> just not understanding the R concept of lag(). In my notion of a
> lagged vector, I want a vector f where f[35] is e[34], i.e. is the
> first residual above of -5073.24843. This is just not what I get by
> saying lag(e,1) - I am just not understanding lag(). I would be very
> happy if someone could educate me on how to utilise lag().
> 
> Okay, I try to get my way in a different way:
> 
> > print(T)
> [1] 167
> > f = numeric(T)
> > f[1] = NA
> > f[2:T] = e[1:(T-1)]
> 
> This looks reasonable? I thought this should do the trick. I am
> hand-initialising a T-length vector with NA in the 1st elem, and I
> copy out the values of e[] from 1 till 166 into f[2:T]. I thought this
> should give me a lagged e. It doesn't --
> 
> > print(f)
>   [1]           NA  -5073.24843  -4210.27886  -8218.01782  -1489.10583
>   (lines deleted)
> [131]   1859.49628  -4988.82853 -25172.43241           NA           NA
>   (lines deleted)
> [166]           NA           NA
> 
> I thought "Okay, what seems to be happening is that the e[1] that I
> have is `actually' the e[34] of my thoughts". So I try:
> 
> > f=rep(NA, T)               # zap out f
> > f[35:T] = e[34:(T-1)]      # copy out useful stuff into 35..T
> > print(f)
>   [1]           NA           NA           NA           NA           NA
>   (lines deleted)
>  [31]           NA           NA           NA           NA   7660.00638
>  [36]  -1271.04645 -10917.29418 -11111.60144  -1597.98355  -1066.01901
>   (lines deleted)
> [131]   1859.49628  -4988.82853 -25172.43241           NA           NA
>   (lines deleted)
> [166]           NA           NA
> 
> This is wrong!!
> 
> Recall (from upstairs) that e[34] was -5073.24843. That value seems to
> have mysteriously vanished. Instead, the first non-NA in f - which is
> f[35] - is 7660.00638, which (incidentally) was e[67]. I just don't
> know how that value got here. And, the values in f[] seem to peter out
> at 133!  After 133, they are all NA until the end.
> 
> I guess I'm _just_ not understanding what is the animal that is
> returned by residual(lm()). I know I am missing something basic,
> because lots of people must be doing what I am trying: I.e. to run a
> regression, extract a residual, lag it, and use it for a 2nd stage
> regression.
> 
> I know that the vector e (returned by residual(lm())) is different
> from a simple vector, for when I say:
> 
> > print(f[35])
> [1] 7660.006
> > print(e[35])
>        68 
> -1271.046 
> 
> the two animals seem to be different. I don't understand e[35] - why
> is it not just a number - there seems to be some index tagging along?
> How do I get at the pure numbers of the residuals?
> 
> Thanks much,
> 
>        -ans.
> 
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595