[R-sig-hpc] Matrix multiplication

Wed Mar 14 15:47:32 CET 2012

>  On other machines, I might use a
> multithreaded BLAS like gotoblas so that I have some flexibility (though
> apparently unlike Claudia, I rarely change it in practice).

:-) Yes, I do change it in practice, because I have steps where I use
explicit parallelization via multicore or snow and I switch between the
3 different parallel computation types. Our server has 2 hex-core CPUs
but only 8 GB RAM. The spectroscopic data analysis I use usually isn't
really hard computationally, but the data sets are often uncomfortably
large for the server. With explicit parallelization RAM often restricts
me to 2 or 3 threads.

Here's what I observe and why I switch back and forth:

If the calculation is implicitly parallel with the optimized BLAS,
that's the way to go. Easiest on RAM, fast, no whatsoever coding effort.
Just lean back and enjoy seeing all cores hard at work.
There are functions like %*% and (t)crossprod that use all 12 cores (or
whatever I restrict NUM_GOTO_THREADS to).

Other functions, e.g. loess () seem never to use more than all 6 cores
of one CPU. For these, I'm better off with explicit parallelization with
2 snow nodes and NUM_GOTO_THREADS = 6 (I have to execute taskset on each
node). However, snow (and multicore) need more RAM as the data must be
loaded in each node. That would mean e.g. NUM_GOTO_THREADS = 11 (to
leave an "alibi-core" for my colleague) in the main R session, and e.g.
2 nodes with NUM_GOTO_THREADS = 6 or 3 nodes with NUM_GOTO_THREADS = 4.

Multicore doesn't make use of the implicit parallelization of the BLAS.
But it is easier to use than snow: no cluster set up required, no hassle
with exporting all variables, etc.
So, if the function anyways doesn't have any implicit parallelization, I
just change lapply to mclapply, and that's it.

Best,

Claudia