[R] Moving window regressions - how can I improve this code?
Douglas Bates
bates at stat.wisc.edu
Sat Apr 24 15:58:21 CEST 2004
Ajay Shah <ajayshah at mayin.org> writes:
> I wrote a function which does "moving window" regressions. E.g. if
> there are 100 observations and the window width is 50, then I first
> run the regression for observations 1..50, then for 2..51, and so on.
>
> I am extremely pleased with R in my experience with writing this,
> since I was able to pass the model as an argument into the function
> :-) Forgive me if I sound naive, but that's rocket science to me!!
>
> For a regression with K explanatory variables, I make a matrix with
> 2*K+2 columns, where I capture:
> K coefficients and K standard errors
> the residual sigma
> R^2
>
> My code is:
>
> # ------------------------------------------------------------
> movingWindowRegression <- function(data, T, width, model, K) {
> results = matrix(nrow=T, ncol=2*K+2)
> for (i in width:T) {
> details <- summary.lm(lm(as.formula(model), data[(i-width+1):i,]))
> n=1;
> for (j in 1:K) {
> results[i, n] = details$coefficients[j, 1]
> results[i, n+1] = details$coefficients[j, 2]
> n = n + 2
> }
> results[i, n] = details$sigma
> results[i, n+1] = details$r.squared
> }
> return(results)
> }
>
> # Simulate some data for a linear regression
> T = 20
> x = runif(T); y = 2 + 3*x + rnorm(T);
> D = data.frame(x, y)
>
> r = movingWindowRegression(D, T=T, width=10, model="y ~ x", K=2)
> print(r)
> # ------------------------------------------------------------
>
> I would be very happy if you could look at this and tell me how to do
> things better.
First, thanks for posting the code. It takes courage to send your
code to the list like this for public commentary. However, questions
like this provide instructive examples.
Some comments:
As Gabor pointed out, there is a generic function nrow that can be
applied to data frames.
As a matter of style, it is better to use
details <- summary(lm(model, data[<row range>,]))
That is, call the generic function "summary", not the specific name
of the method "summary.lm". It is redundant to use summary.lm(lm(...))
and this construction is not guaranteed to continue to be valid.
With regard to the looping, I would suggest using a list of summary
objects as the basic data structure within your function, then
extracting the pieces that you want from that list using lapply or sapply.
That is, I would start with
movingWindow <- function(formula, data, width, ...) {
nr = nrow(data)
width = as.integer(width)[1]
if (width <= 0 || width >= nr)
stop(paste("width must be in the range 1,...,", nr, sep=""))
nreg = nr - width
base = 0:(width - 1)
sumrys <- lapply(1:nreg,
function(st) {
summary(lm(formula, data[base + st,]))
})
sumrys
}
I changed the argument name "model" to "formula" and changed the
order of the arguments to the standard order used in R modeling
functions.
You may not want to use this form for very large data sets because
the raw summaries could be very large. However this is a place to
start.
The reason for creating a list is so you can use sapply and lapply to
extract results from the list. To get the coefficients and standard
errors from a summary, use the coef generic and the select columns from
the result. For example,
> data(women)
> sumrys = movingWindow(weight ~ height, women, 9)
> unlist(lapply(sumrys, function(sm) sm$sigma)) # sigma values
[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
> sapply(sumrys, "[[", "sigma") # same thing
[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
> sapply(sumrys, function(sm) coef(sm)[,1:2])
[,1] [,2] [,3] [,4] [,5]
[1,] -59.77777778 -67.12777778 -73.42222222 -81.97222222 -91.7777778
[2,] 3.00000000 3.11666667 3.21666667 3.35000000 3.5000000
[3,] 3.77647036 2.64466036 3.70537766 4.71400546 5.1522157
[4,] 0.06085806 0.04194352 0.05784947 0.07246601 0.0780042
[,6]
[1,] -106.12777778
[2,] 3.71666667
[3,] 6.60219186
[4,] 0.09846709
>
The last part shows how to get the coefficients estimates and their
standard errors as columns of a matrix. I think I would return the
coefficients and standard errors separately, as in
movingWindow <- function(formula, data, width, ...) {
nr = nrow(data)
width = as.integer(width)[1]
if (width <= 0 || width >= nr)
stop(paste("width must be in the range 1,...,", nr, sep=""))
nreg = nr - width
base = 0:(width - 1)
sumrys <- lapply(1:nreg,
function(st) {
summary(lm(formula, data[base + st,]))
})
list(coefficients = sapply(sumrys, function(sm) coef(sm)[,"Estimate"]),
Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std. Error"]),
sigma = sapply(sumrys, "[[", "sigma"),
r.squared = sapply(sumrys, "[[", "r.squared"))
}
> movingWindow(weight ~ height, women, width = 9)
$coefficients
[,1] [,2] [,3] [,4] [,5] [,6]
(Intercept) -59.77778 -67.127778 -73.422222 -81.97222 -91.77778 -106.127778
height 3.00000 3.116667 3.216667 3.35000 3.50000 3.716667
$Std.Error
[,1] [,2] [,3] [,4] [,5] [,6]
(Intercept) 3.77647036 2.64466036 3.70537766 4.71400546 5.1522157 6.60219186
height 0.06085806 0.04194352 0.05784947 0.07246601 0.0780042 0.09846709
$sigma
[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
$r.squared
[1] 0.9971276 0.9987338 0.9977411 0.9967352 0.9965351 0.9951107
The next enhancements come from realizing that you do not need
to call lm repeatedly. The lm function involves many steps working
with the formula and the data frame and optional arguments to
produce a model matrix and response vector. You only need to do
that once. After that you can call lm.fit on subsets of the rows.
If you arrange that the call to lm is based on the "matched call" to
your function then you can get the ability to work with standard
modeling arguments such as subset and na.action for free. This may
seem like an obscure point but there are big gains in having your
modeling function behave like all the other R modeling functions.
Thomas Lumley's article on "Standard non-standard evaluation in R"
(which I would say was available at developer.r-project.org except
that the developer.r-project.org site is still down) explains this
in more detail.
Furthermore, by doing the initial lm fit you find out how many
coefficients there are in the model and can do a better job of
checking for a sensible "width" argument.
There are subtleties in this version of movingWindow2 involving
manipulation of the arguments in the original call to lm but these
are explained in the manual page for lm. You do need to set the
class and the terms component in the result of lm.fit before you can
apply summary to it.
movingWindow2 <- function(formula, data, width, ...) {
mCall = match.call()
mCall$width = NULL
mCall[[1]] = as.name("lm")
mCall$x = mCall$y = TRUE
bigfit = eval(mCall, parent.frame())
ncoef = length(coef(bigfit))
nr = nrow(data)
width = as.integer(width)[1]
if (width <= ncoef || width >= nr)
stop(paste("width must be in the range ", ncoef + 1,
",...,", nr - 1, sep=""))
y = bigfit$y
x = bigfit$x
terms = bigfit$terms
base = 0:(width - 1)
sumrys <- lapply(1:(nr - width),
function(st) {
inds = base + st
fit = lm.fit(x[inds,], y[inds])
fit$terms = terms
class(fit) = "lm"
summary(fit)
})
list(coefficients = sapply(sumrys, function(sm) coef(sm)[,"Estimate"]),
Std.Error = sapply(sumrys, function(sm) coef(sm)[,"Std. Error"]),
sigma = sapply(sumrys, "[[", "sigma"),
r.squared = sapply(sumrys, "[[", "r.squared"))
}
> movingWindow2(weight ~ height, women, 9)
$coefficients
[,1] [,2] [,3] [,4] [,5] [,6]
(Intercept) -59.77778 -67.127778 -73.422222 -81.97222 -91.77778 -106.127778
height 3.00000 3.116667 3.216667 3.35000 3.50000 3.716667
$Std.Error
[,1] [,2] [,3] [,4] [,5] [,6]
(Intercept) 3.77647036 2.64466036 3.70537766 4.71400546 5.1522157 6.60219186
height 0.06085806 0.04194352 0.05784947 0.07246601 0.0780042 0.09846709
$sigma
[1] 0.4714045 0.3248931 0.4481000 0.5613193 0.6042180 0.7627228
$r.squared
[1] 0.9971276 0.9987338 0.9977411 0.9967352 0.9965351 0.9951107
The use of lapply and sapply on lists is a style of programming
called "functional programming". The functional programming
facilities in the S language are not as widely recognized and
appreciated as they should be. Phil Spector's book "An Introduction
to S and S-PLUS" is one place where these are discussed and
illustrated.
P.S. If building a list of summaries is taking too much memory then
replace summary(fit) by summary(fit)[c("coefficients", "sigma", "r.squared")]
More information about the R-help
mailing list