[R] finding a faster way to run lm on rows of predictor matrix

Joshua Wiley jwiley.psych at gmail.com
Fri Jul 29 18:36:41 CEST 2011


Hi,

If you are doing repeated fits in this specialized case, you can
benefit by skipping some of the fancy overhead and going straight to
lm.fit.

## what you had
data.residuals <- t(apply(predictors, 1, function(x)( lm(regress.y ~
-1 + as.vector(x))$residuals)))

## passing the matrices directly to lm.fit (no need to augment the X
matrix because you do not want the intercept anyway)
data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x =
t(predictors[i, , drop = FALSE]), y = regress.y)$residuals))

On my little laptop:

> system.time(data.residuals <- t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))))
   user  system elapsed
  42.84    0.11   45.01
> system.time(data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x = t(predictors[i, , drop = FALSE]), y = regress.y)$residuals)))
   user  system elapsed
   3.76    0.00    3.82

If those timing ratios hold on your system, you should be looking at
around 1.2 seconds per run.  Which suggests it will take around 2
hours to complete 5,000 runs.  Note that this is the sort of task that
can be readily parallelized, so if you are on a multicore machine, you
could take advantage of that without too much trouble.

Cheers,

Josh

On Fri, Jul 29, 2011 at 8:30 AM,  <cypark at princeton.edu> wrote:
> Hi, everyone.
> I need to run lm with the same response vector but with varying predictor vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors)
> After looking through the R archive, I found roughly 3 methods that has been suggested.
> Unfortunately, I need to run this task multiple times(~ 5,000 times) and would like to find a faster way than the existing methods.
> All three methods I have bellow run on my machine timed with system.time 13~20 seconds.
>
> The opposite direction of 6,000 response vectors and 1 predictor vectors, that is supported with lm runs really fast ~0.5 seconds.
> They are pretty much performing the same number of lm fits, so I was wondering if there was a faster way, before I try to code this up in c++.
>
> thanks!!
>
> ## sample data ###
> regress.y = rnorm(150)
> predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
>
> ## method 1 ##
> data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals)))
>
> user  system elapsed
>  15.076   0.048  15.128
>
> ## method 2 ##
> data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), nrow=nrow(predictors), ncol=ncol(predictors) )
>
> for( i in 1:nrow(predictors)){
>    pred = as.vector(predictors[i,])
>    data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
> }
>
>  user  system elapsed
>  13.841   0.012  13.854
>
> ## method 3 ##
> library(nlme)
>
> all.data <- data.frame( y=rep(regress.y, nrow(predictors)), x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
> all.fits <- lmList( y ~ x | g, data=all.data)
> data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), ncol=ncol(predictors))
>
> user  system elapsed
>  36.407   0.484  36.892
>
>
> ## the opposite direction, supported by lm ###
> lm(t(predictors) ~ -1 + regress.y)$residuals
>
>  user  system elapsed
>  0.500   0.120   0.613
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-help mailing list