[R-sig-hpc] Parallel linear model

Martin Morgan mtmorgan at fhcrc.org
Wed Aug 22 23:21:31 CEST 2012


On 08/22/2012 12:47 AM, Patrik Waldmann wrote:
> Hello,
>
>
> I wonder if someone has experience with efficient ways of implicit parallel execution of (repeated) linear models (as in the non-parallel example below)? Any suggestions on which way to go?
>
> Patrik Waldmann
>
> pval<-c(1:n)
> for (i in 1:n){
> mod <- lm(y ~ x[,i])
> pval[i] <- summary(mod)$coefficients[2,4]
> }

As a different tack, the design matrix is the same across all 
regressions, and if your data are consistently structured it may pay to 
re-calculate the fit alone. Here's a loosely-tested version that uses a 
template from a full fit augmented by the fit of individual columns to 
the same model

looselm <- function(y, xi, tmpl)
{
     x <- cbind(`(Intercept)`= 1, xi=xi)
     z <- lm.fit(x, y)
     tmpl[names(z)] <- z
     tmpl
}

This is used in f2

f0 <- function(x, y)
     lapply(seq_len(ncol(x)),
            function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
            x, y)

f1 <- function(x, y, mc.cores=8L)
     mclapply(seq_len(ncol(x)),
              function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
              x, y, mc.cores=mc.cores)

f2 <- function(x, y) {
     tmpl <- lm(y~x[,1])
     lapply(seq_len(ncol(x)),
            function(i, x, y, tmpl)  {
                summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
            }, x, y, tmpl)
}

f3 <- function(x, y, mc.cores=8) {
     tmpl <- lm(y~x[,1])
     mclapply(seq_len(ncol(x)),
              function(i, x, y, tmpl)  {
                  summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
              }, x, y, tmpl, mc.cores=mc.cores)
}

with timings (for 1000 x 1000)

 > system.time(ans0 <- f0(x, y))
    user  system elapsed
  23.865   1.160  25.120
 > system.time(ans1 <- f1(x, y, 8L))
    user  system elapsed
  31.902   6.705   6.708
 > system.time(ans2 <- f2(x, y))
    user  system elapsed
   5.285   0.296   5.596
 > system.time(ans3 <- f3(x, y, 8L))
    user  system elapsed
  10.256   4.092   2.322

and

 > identical(ans0, ans1)
[1] TRUE
 > identical(ans0, ans2)
[1] TRUE
 > identical(ans0, ans3)
[1] TRUE

Presumably the full summary() machinery is also not required. Likely 
there are significant additional short-cuts.

Martin

>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the R-sig-hpc mailing list