[R] linear model benchmarking
ivo welch
ivowel at gmail.com
Mon Apr 23 05:04:06 CEST 2012
I cleaned up my old benchmarking code and added checks for missing
data to compare various ways of finding OLS regression coefficients.
I thought I would share this for others. the long and short of it is
that I would recommend
ols.crossprod = function (y, x) {
x <- as.matrix(x)
ok <- (!is.na(y))&(!is.na(rowSums(x)))
y <- y[ok]; x <- subset(x, ok)
x <- cbind( 1, x)
XtX <- crossprod(x)
Xty <- crossprod(x, y)
solve(XtX, Xty)
}
for fast and stable coefficients. (yes, stable using double
precision, even though not as stable as lm(). it works just fine with
X variables that have >99.99% correlation with as few as 100
observations. if your situation is worse than this, you probably have
an error in your data---or you are looking for the Higgs Boson.)
I added the code below. feel free to ignore.
/iaw
################################################################
###
### code to test alternatives how fast OLS coefficients can be obtained.
### including tests to exclude missing observations where necessary.
###
### for a more informed article (and person), see Bates, 'Least Squares
### Calculations in R', Rnews 2004-1. the code here does not test his sparse
### matrix examples, or his geMatrix/poMatrix examples.
###
### Basic Results: for the examples that I tried, typical relative time
### factors of the algorithms were about
###
### lm lmfit solve crossprod cholesky (special-case 2vars)
### 1.0 0.5 0.3 0.15 0.17 0.1
###
### there was no advantage to cholesky, so you may as well use the simpler
### crossprod.
###
### I was also interested in algorithm scaling N and K. yes, there were
### some changes in the factors across algorithms, but the general pattern
### wasn't too different. for the cholesky decomposition,
###
### N=1000 N=10000 N=100000 N=200000
### K=1 1.0 7 80 160
### K=10 2.5 26
### K=50 16
### K=100 43 70
### K=200 140
###
### some of this may well be swap/memory access, etc. roughly speaking, we
### scale ten-times N takes twice as long. 10 and ten-times K takes 25 times
### as long.
###
### of course, ols.crossprod and ols.cholesky are not as stable as ols.lm,
### but they are still amazingly stable, given the default double precision.
### even with a correlation of 0.999999(!) between the two final columns,
### they still produce exactly the same result as ols.lm with 1000
### observations. frankly, the ill-conditioning worry is overblown with
### most real-world data. if you really have data THIS bad, you should
### already know it; and you probably just have some measurement errors in
### your observations, and your regression is giving you garbage either way.
###
### if I made the R core decision, I would switch away from lm()'s default
### method, and make it a special option. my guess is that it is status-quo
### bias that keeps the current method. or, at least I would say loudly in
### the R docs that for common use, here is a much faster method...
###
################################################################
MC <- 100
N <- 1000
K <- 10
SD <- 1e-3
ols <- list(
ols.lm = function (y, x) { coef(lm(y ~ x)) },
ols.lmfit = function (y, x) {
x <- as.matrix(x)
ok <- (!is.na(y))&(!is.na(rowSums(x)))
y <- y[ok]; x <- subset(x, ok)
x <- as.matrix(cbind( 1, x))
lm.fit(x, y)$coefficients
},
ols.solve = function (y, x) {
x <- as.matrix(x)
ok <- (!is.na(y))&(!is.na(rowSums(x)))
y <- y[ok]; x <- subset(x, ok)
x <- cbind(1, x)
xy <- t(x)%*%y
xxi <- solve(t(x)%*%x)
b <- as.vector(xxi%*%xy)
b
},
ols.crossprod = function (y, x) {
x <- as.matrix(x)
ok <- (!is.na(y))&(!is.na(rowSums(x)))
y <- y[ok]; x <- subset(x, ok)
x <- cbind( 1, x)
XtX <- crossprod(x)
Xty <- crossprod(x, y)
solve(XtX, Xty)
},
ols.cholesky = function (y, x) {
x <- as.matrix(x)
ok <- (!is.na(y))&(!is.na(rowSums(x)))
y <- y[ok]; x <- subset(x, ok)
x <- cbind( 1, x)
ch <- chol( crossprod(x) )
backsolve(ch, forwardsolve(ch, crossprod(x,y),
upper=TRUE, trans=TRUE))
}
)
set.seed(0)
y <- matrix(rnorm(N*MC), N, MC)
x <- array(rnorm(MC*K*N), c(N, K, MC))
cat("N=", N, "K=", K, " (MC=", MC, ")")
if (K>1) {
sum.cor <- 0
for (mc in 1:MC) {
x[,K,mc] <- x[,K-1,mc]+rnorm(N, sd=SD)
sum.cor <- sum.cor + cor(x[,K,mc], x[,K-1,mc], use="pair")
}
options(digit=10)
cat( " sd=", SD, "The bad corr= 1+", sum.cor/MC-1 )
} else {
ols$ols.xy.Kis1 <- function(y, x) {
ok <- (!is.na(y))&(!is.na(x))
y <- y[ok]; x <- x[ok]
b <- cov(y,x)/var(x)
c(mean(y) - b*mean(x), b)
}
}
cat("\n")
## make sure we do the NaN checks
x[1,1,1] <- NA
y[2,2] <- NA
options(digits=2)
time.factor.first <- first.checksum <- NA
for (ctr in 1:length(ols)) {
invisible({gc(); gc(); gc()})
coefsum <- 0.0
v <- system.time(for (mc in 1:MC) coefsum <- coefsum +
ols[[ctr]](y[,mc], x[,,mc]))
thiscoefsum <- sum(abs(coefsum))
thistime <- sum(as.numeric(v)[1:2])/MC
if (is.na(time.factor.first)) time.factor.first <- thistime
if (is.na(first.checksum)) first.checksum <- thiscoefsum
cat(sprintf("%13.13s", names(ols[ctr])), " Checksum = ", thiscoefsum, "(",
sprintf("%.8f", abs(thiscoefsum - first.checksum)/MC), ")",
"Time =", sprintf("%.3f", 100*thistime), "msecs. Factor=",
thistime/time.factor.first, "\n")
}
More information about the R-help
mailing list