[R] finding a faster way to run lm on rows of predictor matrix

Dennis Murphy djmuser at gmail.com
Sat Jul 30 02:37:27 CEST 2011


Hi:

The reason why you can fit N linear models to the same predictor
quickly is because the model matrix X is fixed and you can store the
multiple Y's as a matrix. Consequently, you can run the N linear
regressions in one shot using least squares, and lm() implements this
efficiently by allowing the response to be a matrix.

Your intention is to keep Y fixed and change the model matrix each
time. With 6000 potential predictors, this means 6000 separate
regression analyses, or to put it another way, 6000 iterations of the
least squares algorithm. It's unrealistic to expect the same timing
performance because the computing tasks are entirely different in
nature. As Josh showed, you can improve timing performance by
stripping down the computing task to its bare essentials. It might
also be worthwhile to peruse the article by Douglas Bates on least
squares calculations in R (start on p. 17) for some useful tips:
http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf

And for God's sake, don't run regression through the origin by
default, even if it's 'scientifically justifiable'. If you don't know
why, you need to do some reading. The topic has been hashed over in
this list several times - check the archives or do a Google search.

HTH,
Dennis

On Fri, Jul 29, 2011 at 8:30 AM,  <cypark at princeton.edu> wrote:
> Hi, everyone.
> I need to run lm with the same response vector but with varying predictor vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors)
> After looking through the R archive, I found roughly 3 methods that has been suggested.
> Unfortunately, I need to run this task multiple times(~ 5,000 times) and would like to find a faster way than the existing methods.
> All three methods I have bellow run on my machine timed with system.time 13~20 seconds.
>
> The opposite direction of 6,000 response vectors and 1 predictor vectors, that is supported with lm runs really fast ~0.5 seconds.
> They are pretty much performing the same number of lm fits, so I was wondering if there was a faster way, before I try to code this up in c++.
>
> thanks!!
>
> ## sample data ###
> regress.y = rnorm(150)
> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
>
> ## method 1 ##
> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals)))
>
> user  system elapsed
>  15.076   0.048  15.128
>
> ## method 2 ##
> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), nrow=nrow(predictors), ncol=ncol(predictors) )
>
> for( i in 1:nrow(predictors)){
>    pred = as.vector(predictors[i,])
>    data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
> }
>
>  user  system elapsed
>  13.841   0.012  13.854
>
> ## method 3 ##
> library(nlme)
>
> all.data <- data.frame( y=rep(regress.y, nrow(predictors)), x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
> all.fits <- lmList( y ~ x | g, data=all.data)
> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), ncol=ncol(predictors))
>
> user  system elapsed
>  36.407   0.484  36.892
>
>
> ## the opposite direction, supported by lm ###
> lm(t(predictors) ~ -1 + regress.y)$residuals
>
>  user  system elapsed
>  0.500   0.120   0.613
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list