[R-sig-hpc] Matrix multiplication

Wed Mar 14 18:24:57 CET 2012

On Mar 14, 2012, at 12:53 PM, Paul Gilbert wrote:

> 
> 
> On 12-03-13 09:59 PM, Simon Urbanek wrote:
>> 
>> On Mar 13, 2012, at 3:05 PM, Paul Gilbert wrote:
>> 
>>> 
>>> 
>>> On 12-03-13 12:50 PM, Brian G. Peterson wrote:
>>>> On Tue, 2012-03-13 at 12:40 -0400, Paul Gilbert wrote:
>>>>> Brian
>>>>> 
>>>>> Thanks for spelling this out for those of us that are a bit slow.
>>>>> (Newbie questions below)
>>>> 
>>>> <... snip ...>
>>>> 
>>>>>> So, if your BLAS does multithreaded matrix multiplication, it will use
>>>>>> multiple threads 'implicitly', as Simon pointed out.
>>>>> 
>>>>> Is there an easy way to know if the R I am using has been compiled with
>>>>> multi-thread BLAS support?
>>>> 
>>>> BLAS should be 'plug and play', as R is usually compiled to use a shared
>>>> object BLAS.  As such, installing the BLAS on your machine (and
>>>> appropriately configuring it) should 'just work' with te new BLAS when
>>>> you restart R.
>>>> 
>>>> Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
>>>> the BLAS libraries, that should have some additional details.
>>> 
>>> (I have a long history of getting things that should 'just work' to 'just not work'.) But I didn't really state my question very well. I'm really wondering about two related situations. How can I confirm after a change to underlying system that R is using the new configuration, and second, if I am  running benchmarks in R is there an easy way to record the underlying configuration that is being used.
>>> 
>> 
>> You can check whether you're leveraging multiple cores simply via system.time:
>> 
>>> m=matrix(rnorm(4e6),2000)
>>> system.time(m %*% m)
>>    user  system elapsed
>>   6.860   0.020   0.584
>> 
>> The above is clearly using threaded BLAS (here I'm using ATLAS), because
>> the elapsed time is much smaller than the CPU time so it was computed in parallel.
> 
> Perhaps I am misreading something. I don't see elapse < CPU,

0.584 < 6.86

> so it does not seem quite as obvious as you suggest, but I certainly see the difference with the single-thread below.
> 
>> In contrast this is what you get using single-threaded R BLAS on the same machine:
>> 
>>> system.time(m %*% m)
>>    user  system elapsed
>>  10.480   0.020  10.505
>> 
>> It takes about 18x longer - this is a combination of the number of cores and the less optimized BLAS - and the elapsed time is greater or equal to the CPU time = single-threaded.
>> 
>> As for recording the underlying configuration - that is not really possible in general - you have toknow what you enabled/compiled. In case of a shared BLAS implementation 
> you may be able to infer that from the library name, but for static BLAS it is close to impossible to figure it out.
> 
> I was afraid this would be the case. It is often hard to keep track even when I'm compiling R myself, and I guess if you don't compile yourself there is not much hope of knowing what you really have.
> (Food for thought when considering timing comparisons.)
> 

It is separate from R (at least as long as you have shared BLAS enabled which is the default for most distributions) -- so it's really about what you point your BLAS to.

But, yes, timing comparisons are pretty meaningless unless you specify everything you have (this is how some can post benchmarks against strawman installations and claim to be faster although there is in fact no difference).

Cheers,
Simon

> Thanks,
> Paul
> 
>> Cheers,
>> Simon
>> 
>> 
>> 
>>> Thanks again,
>>> Paul
>>>> 
>>>> <...snip...>
>>>> 
>>>>>> Be aware that there can be unintended negative interactions between
>>>>>> implicit and explicit parallelization.  On cluster nodes I tend to
>>>>>> configure the BLAS to use only one thread to avoid resource contention
>>>>>> when all cores are doing explicit parallelization.
>>>>> 
>>>>> How do you do this? Does it need to be done when you are compiling R, or
>>>>> can it be done on the fly while running R processes?
>>>> 
>>>> Some BLAS, like gotoblas, support an environment variable to change the
>>>> number of cores to be used.  This can be changed at run-time.  Others,
>>>> like the mkl, are always multithreaded.  Others, like ATLAS, can be
>>>> compiled in either single threaded or multi-threaded modes.
>>>> 
>>>> So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
>>>> that *explicit* parallelization will be the primary driver of CPU load,
>>>> and not wanting to over-commit the processor when 12 calculations each
>>>> try to spawn 12 threads in the BLAS.  On other machines, I might use a
>>>> multithreaded BLAS like gotoblas so that I have some flexibility (though
>>>> apparently unlike Claudia, I rarely change it in practice).
>>>> 
>>>> Regards,
>>>> 
>>>>    - Brian
>>>> 
>>> 
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>> 
>>> 
>> 
> 
>