[R] Operating on windows of data

Wed Mar 24 03:56:46 CET 2004

It appears that I owe Martin Maechler <maechler at stat.math.ethz.ch>
an apology for not realising the importance of the context for what
I quoted.  I apologise.

	but *please* note again the code snippet we where talking about :

	    > dat <- sapply( seq(T-width), function(i) {
	    >     model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf, A, 
	    >                 i:(i+width-1))
	    >     details <- summary.lm(model)
	    >     tmp <- coefficients(model)
	    >     c( USD = tmp[2], JPY = tmp[3], DEM = tmp[4], 
	    >            R2 = details$r.squared, RMSE = details$sigma )
	    > } )
	    > dat <- as.data.frame(t(dat))
	    > attach(dat)

	which is really an example where sapply() rather obfuscates than
	clarifies.

It's not clear to me that the choice of sapply() -vs- 'for' really has
anything to do with it here.  Hmm, maybe it does.  Looking at this code,
I can see at a glance that
    - dat will be a matrix
    - it will have columns 1:T-width
    - it will have rows USD, JPY, DEM, R2, RMSE
    - each column reflects one linear model
and I don't have to decode a lot of indexed assignment statements to
figure this out.

The first way to improve clarity would be to use keyword parameters on
the call to lm, e.g., lm(..., data = A, subset = i:(i+width-1)).

The second way to improve clarity would be to use character indices on
tmp rather than integer indices:

	coef <- coefficients(model)
	c(USD = coef["dlusdchf"],
	  JPY = coef["djpychf"],
	  DEM = coef["dldemchf"],
	  R2  = details$r.squared,
	  RMSE= details$sigma)

Hmm.  My "first" and "second" ways are both the same: use names rather
than position.  There is one more clarity improvement to recommend, and
it has nothing to do with using or avoiding sapply(), at least not
directly.

    # dfapply(X, FUN, ...) is like sapply() but
    # expects FUN to return c(x1=...,xn=...) vectors which it
    # turns into rows of the data frame that it returns.

    dfapply <- function (...) as.data.frame(t(sapply(...)))

    # Make "dat" a data frame with columns USD, JPY, DEM, R2, RMSE
    # and rows 1:T-width, the ith row extracted from a linear
    # regression on cases i:(i+width-1).

    dat <- dfapply(seq(T-width), function (i) {
        model <- lm(dlinrchf ~ dlusdchf + dljpychf + dldemchf,
		    data = A, subset = i:(i+width-1))
	s <- summary.lm(model)
	v <- coefficients(model)
	c(USD = v["dlusdchf"], JPY = v["djpychf"], DEM = v["dldemchf"],
	  R2 = s$r.squared, RMSE = s$sigma)
    })

Now here's where using sapply() instead of 'for' does pay off, even here.
We ask the question "where is 'i' used?"  Because we're *not* using i in
any visible index calculations, there is only one place that 'i' is used,
and that's in the subset= argument of the lm() call.

That prompts the question "is there any way to exploit the fact that the
rest of the linear model is the same?  Depending on the relative sizes
of A and T-width, there may well be, and Statistical Models in S explains,
if memory serves me, how to do this kind of thing.  But without the fact
that i is only used in one place, it might not be as obvious that it was
worth thinking about.