[R-sig-hpc] Matrix multiplication

Thu Mar 15 03:48:31 CET 2012

Claudia,

On Mar 14, 2012, at 10:47 AM, Claudia Beleites wrote:

>> On other machines, I might use a
>> multithreaded BLAS like gotoblas so that I have some flexibility (though
>> apparently unlike Claudia, I rarely change it in practice).
> 
> :-) Yes, I do change it in practice, because I have steps where I use
> explicit parallelization via multicore or snow and I switch between the
> 3 different parallel computation types. Our server has 2 hex-core CPUs
> but only 8 GB RAM. The spectroscopic data analysis I use usually isn't
> really hard computationally, but the data sets are often uncomfortably
> large for the server. With explicit parallelization RAM often restricts
> me to 2 or 3 threads.
> 
> Here's what I observe and why I switch back and forth:
> 
> If the calculation is implicitly parallel with the optimized BLAS,
> that's the way to go. Easiest on RAM, fast, no whatsoever coding effort.
> Just lean back and enjoy seeing all cores hard at work.
> There are functions like %*% and (t)crossprod that use all 12 cores (or
> whatever I restrict NUM_GOTO_THREADS to).
> 
> Other functions, e.g. loess () seem never to use more than all 6 cores
> of one CPU. For these, I'm better off with explicit parallelization with
> 2 snow nodes and NUM_GOTO_THREADS = 6 (I have to execute taskset on each
> node). However, snow (and multicore) need more RAM

Snow does but not multicore - the benefit of multicore is that all data at the point of parallelization is shared and thus it doesn't use extra memory (at least on modern OSes that support COW fork). The only extra RAM will be whatever is allocated later for the computation that is run in parallel.

> as the data must be
> loaded in each node. That would mean e.g. NUM_GOTO_THREADS = 11 (to
> leave an "alibi-core" for my colleague) in the main R session, and e.g.
> 2 nodes with NUM_GOTO_THREADS = 6 or 3 nodes with NUM_GOTO_THREADS = 4.
> 
> Multicore doesn't make use of the implicit parallelization of the BLAS.

Actually, it does:

> system.time(mclapply(1:4, function(i) sum(tcrossprod(m^i))))
   user  system elapsed 
 10.136   0.568   0.664 

However, you really want to control the interplay of the explicit and implicit parallelization. This is where the parallel package comes into play (and why it includes multicore) so that for the explicit + R-implicit parallelization (not BLAS, though) we can control the maximal load (and RNG).

Cheers,
Simon

> But it is easier to use than snow: no cluster set up required, no hassle
> with exporting all variables, etc.
> So, if the function anyways doesn't have any implicit parallelization, I
> just change lapply to mclapply, and that's it.
> 
> Best,
> 
> Claudia
> 
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
> 
>