[R] linear model benchmarking

Mon Apr 23 05:04:06 CEST 2012

I cleaned up my old benchmarking code and added checks for missing
data to compare various ways of finding OLS regression coefficients.
I thought I would share this for others.  the long and short of it is
that I would recommend

           ols.crossprod = function (y, x) {
              x <- as.matrix(x)
              ok <- (!is.na(y))&(!is.na(rowSums(x)))
              y <- y[ok]; x <- subset(x, ok)
              x <- cbind( 1, x)

              XtX <- crossprod(x)
              Xty <- crossprod(x, y)
              solve(XtX, Xty)
            }

for fast and stable coefficients.  (yes, stable using double
precision, even though not as stable as lm().  it works just fine with
X variables that have >99.99% correlation with as few as 100
observations.  if your situation is worse than this, you probably have
an error in your data---or you are looking for the Higgs Boson.)

I added the code below.  feel free to ignore.

/iaw

################################################################
###
### code to test alternatives how fast OLS coefficients can be obtained.
### including tests to exclude missing observations where necessary.
###
### for a more informed article (and person), see Bates, 'Least Squares
### Calculations in R', Rnews 2004-1. the code here does not test his sparse
### matrix examples, or his geMatrix/poMatrix examples.
###
### Basic Results: for the examples that I tried, typical relative time
### factors of the algorithms were about
###
###      lm  lmfit solve crossprod  cholesky  (special-case 2vars)
###      1.0  0.5   0.3    0.15      0.17         0.1
###
### there was no advantage to cholesky, so you may as well use the simpler
### crossprod.
###
### I was also interested in algorithm scaling N and K.  yes, there were
### some changes in the factors across algorithms, but the general pattern
### wasn't too different.  for the cholesky decomposition,
###
###            N=1000   N=10000   N=100000   N=200000
###    K=1       1.0       7        80       160
###    K=10      2.5      26
###    K=50     16
###    K=100    43        70
###    K=200   140
###
### some of this may well be swap/memory access, etc.  roughly speaking, we
### scale ten-times N takes twice as long. 10 and ten-times K takes 25 times
### as long.
###
### of course, ols.crossprod and ols.cholesky are not as stable as ols.lm,
### but they are still amazingly stable, given the default double precision.
### even with a correlation of 0.999999(!) between the two final columns,
### they still produce exactly the same result as ols.lm with 1000
### observations.  frankly, the ill-conditioning worry is overblown with
### most real-world data.  if you really have data THIS bad, you should
### already know it; and you probably just have some measurement errors in
### your observations, and your regression is giving you garbage either way.
###
### if I made the R core decision, I would switch away from lm()'s default
### method, and make it a special option.  my guess is that it is status-quo
### bias that keeps the current method.  or, at least I would say loudly in
### the R docs that for common use, here is a much faster method...
###
################################################################

MC <- 100
N <- 1000
K <- 10
SD <- 1e-3

ols <- list(
            ols.lm = function (y, x) { coef(lm(y ~ x)) },

            ols.lmfit = function (y, x) {
              x <- as.matrix(x)
              ok <- (!is.na(y))&(!is.na(rowSums(x)))
              y <- y[ok]; x <- subset(x, ok)
              x <- as.matrix(cbind( 1, x))
              lm.fit(x, y)$coefficients
            },

            ols.solve = function (y, x) {
              x <- as.matrix(x)
              ok <- (!is.na(y))&(!is.na(rowSums(x)))
              y <- y[ok]; x <- subset(x, ok)
              x <- cbind(1, x)
              xy <- t(x)%*%y
              xxi <- solve(t(x)%*%x)
              b <- as.vector(xxi%*%xy)
              b
            },

            ols.crossprod = function (y, x) {
              x <- as.matrix(x)
              ok <- (!is.na(y))&(!is.na(rowSums(x)))
              y <- y[ok]; x <- subset(x, ok)
              x <- cbind( 1, x)

              XtX <- crossprod(x)
              Xty <- crossprod(x, y)
              solve(XtX, Xty)
            },

            ols.cholesky = function (y, x) {
              x <- as.matrix(x)
              ok <- (!is.na(y))&(!is.na(rowSums(x)))
              y <- y[ok]; x <- subset(x, ok)
              x <- cbind( 1, x)

              ch <- chol( crossprod(x) )
              backsolve(ch, forwardsolve(ch, crossprod(x,y),
upper=TRUE, trans=TRUE))
            }

            )

set.seed(0)
y <- matrix(rnorm(N*MC), N, MC)
x <- array(rnorm(MC*K*N), c(N, K, MC))

cat("N=", N, "K=", K, "  (MC=", MC, ")")
if (K>1) {
  sum.cor <- 0
  for (mc in 1:MC) {
    x[,K,mc] <- x[,K-1,mc]+rnorm(N, sd=SD)
    sum.cor <- sum.cor + cor(x[,K,mc], x[,K-1,mc], use="pair")
  }
  options(digit=10)
  cat( " sd=", SD, "The bad corr= 1+", sum.cor/MC-1 )
} else {
  ols$ols.xy.Kis1 <- function(y, x) {
    ok <- (!is.na(y))&(!is.na(x))
    y <- y[ok]; x <- x[ok]
    b <- cov(y,x)/var(x)
    c(mean(y) - b*mean(x), b)
  }
}
cat("\n")

## make sure we do the NaN checks
x[1,1,1] <- NA
y[2,2] <- NA

options(digits=2)
time.factor.first <- first.checksum <- NA
for (ctr in 1:length(ols)) {
  invisible({gc(); gc(); gc()})
  coefsum <- 0.0
  v <- system.time(for (mc in 1:MC) coefsum <- coefsum +
ols[[ctr]](y[,mc], x[,,mc]))
  thiscoefsum <- sum(abs(coefsum))
  thistime <- sum(as.numeric(v)[1:2])/MC
  if (is.na(time.factor.first)) time.factor.first <- thistime
  if (is.na(first.checksum)) first.checksum <- thiscoefsum
  cat(sprintf("%13.13s", names(ols[ctr])), " Checksum = ", thiscoefsum, "(",
      sprintf("%.8f", abs(thiscoefsum - first.checksum)/MC), ")",
      "Time =", sprintf("%.3f", 100*thistime), "msecs.  Factor=",
thistime/time.factor.first, "\n")
}