[R-SIG-Mac] BLAS performance woes on 64-bit Mac OS X

Thu Sep 9 00:51:13 CEST 2010

Hi everyone!

I just saw Jan de Leeuw's e-mail about his BLAS benchmarks, which is an interesting coincidence -- I've been wanting to ask the list about some inexplicable (and somewhat disappointing) matrix benchmark results on 64-bit Mac OS X with vecLib BLAS.

My problem seems to be different from Jan's, though, as the CPU load on my computer indicates that both cores are used; so it can't be an issue with using multiple threads.   That's why I'm starting a new thread.

For a little bit of background: I'm looking for the fastest method to calculate inner products between large sets of vectors and have been benchmarking various algorithms for this purpose, most of them performing a matrix multiplication M %*% t(M) in different ways.  Test were run on an early 2008 MacBook Pro with Intel Core 2 Duo, 2.5 GHz and 6 GB RAM, using R 2.11.1 on Leopard (10.5.8) and Snow Leopard (10.6.4); I also tried today's R-devel with the same results.

To my big surprise, matrix operations are _much_ slower in 64-bit R than in 32-bit R (controlled with the --arch option).  This was completely unexpected, as 64-bit code is usually a little faster than the equivalent 32-bit code (5%-10% in my experience).  Here are some benchmark results (MOPS is an estimate for million of multiply-accumulate operations per second):

> ------------------------------------------------------------------------
> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 32-bit, vecLib BLAS (default)
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      2.213  5963.45   3.977
> inner tcrossprod D      1.216 10852.89   2.033
> inner crossprod t(M) D  1.230 10729.36   2.038
> 
> ------------------------------------------------------------------------
> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 64-bit, vecLib BLAS (default)
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      3.087  4275.06   5.149
> inner tcrossprod D      2.743  4811.20   3.542
> inner crossprod t(M) D  2.012  6559.20   3.072

As you can see, the 64-bit code is much slower than the 32-bit code especially for the (t)crossprod operation, and my beautiful (and expensive :) MacBook is even outperformed by a high-end netbook running Linux:

> ------------------------------------------------------------------------
> Acer 1810TX, Intel Pentium Dual Core U4100 1.3 GHz, 2MB L2 Cache, 800 MHz FSB, Intel GMA X4500
> Ubuntu Linux 10.04LTS "Lucid Lynx" 64-bit, R 2.11.1, 64-bit, reference BLAS 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      1.674  7883.58    1.66
> inner tcrossprod D      0.966 13661.61    0.97
> inner crossprod t(M) D 18.672   706.79   18.65

After some experimentation, it seems that the culprit is Apple's vecLib.  If I switch to the reference BLAS shipped with R, I get the expected slight advantage for the 64-bit code, and overall performance increases considerably:

> ------------------------------------------------------------------------
> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 32-bit, reference BLAS
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      1.253 10532.41   1.213
> inner tcrossprod D      0.721 18303.90   0.692
> inner crossprod t(M) D 12.845  1027.41  12.651

> MacBook Pro 4,1 (2008), Intel Core 2 Duo T9300 2.5 GHz, 6MB L2 Cache, 800 MHz FSB, GeForce 8600M GT
> Mac OS X 10.5.8, R 2.11.1, 64-bit, reference BLAS
> 
>                          Time     MOPS CPUTime
> inner M %*% t(M) D      1.216 10852.89   1.187
> inner tcrossprod D      0.683 19322.27   0.658
> inner crossprod t(M) D 13.018  1013.76  12.525

I thought this might be a fluke of my particular hardware and software setup, but Jan's results lead me to believe that this may be a general problem with vecLib.  Has anybody else on the list observed similar behaviour?  If so, would it make sense to change the default to the reference BLAS?  In my benchmarks, it was consistently faster than vecLib (except for crossprod), but there may be other operations and situations in which vecLib performs better.

Some other remarks and observation:

 - The reference BLAS performs very poorly on crossprod() as opposed to tcrossprod(), while they're equally fast with vecLib.  If one is aware of this, it's relatively easy to work around in most situations, though (as t() is relatively cheap).

 - I've also tried the standard Ubuntu ATLAS instead of the reference BLAS, which performed very poorly at around 2000 MOPS.  Optimising BLAS libraries seems to be a tricky business ...

 - My vectors are very sparse (part of the task I'm benchmarking for).  This may have an influence on the result (if there are special optimisations for 0 entries in the BLAS libraries), but I doubt this is the case.

 - I did some benchmarks for Euclidean distances between the vectors as well, finding that dist() is an extremely slow operation -- I had been aware of this, just not how bad the situation really is.  dist() runs at about 160 MOPS, while a (numerically unstable) approximation with matrix operations is almost 8x faster.

If you want to try for yourself, you can check out the benchmark code and the sample data set I used from R-Forge:

	svn checkout svn://scm.r-forge.r-project.org/svnroot/wordspace/illustrations benchmark 

Then run the script "matrix_benchmarks.R" in the new directory benchmark/.

I'd be interested to hear about substantially different results on other Mac computers / R versions.  Has anybody got a highly optimised BLAS on the Mac?

Best wishes,
Stefan