[R] Speed up sum of outer products?

Tue Mar 15 12:10:12 CET 2011

Hi Ajay,

thanks for this comparison, which prodded me to give CUDA another try on my now somewhat aging MacBook Pro.

> Hi Dennis, sorry for the delayed reply and thanks for the article. I digged
> into it and found that if you have a GPU, the CUBLAS library beats the
> BLAS/ATLAS implementation in the Matrix package for 'large' problems.

I guess you have a very fast CPU (Core i7 or so, I guess?), a very poor BLAS implementation and a desktop graphics card?

>   user  system elapsed    -- for loop, single thread
> 27.210   6.680  33.342 
>   user  system elapsed    -- BLAS mat mult
>  6.260   0.000   5.982 
>   user  system elapsed    -- BLAS crossprod
>  4.340   0.000   4.284 
>   user  system elapsed    -- CUDA gpuCrossprod
>   1.49    0.00    1.48 

Just to put these numbers in perspective, here are my results for a MacBook Pro running Mac OS X 10.6.6 (Core 2 Duo, 2.5 GHz, 6 GB DDR2 RAM, Nvidia GeForce 8600M GT with 512 MB RAM -- I suppose it's the "M" that breaks my performance here).

>    user  system elapsed    -- for loop, single thread 
> 141.034  35.299 153.783 
>    user  system elapsed    -- BLAS mat mult
>   2.791   0.025   1.805 
>    user  system elapsed    -- BLAS crossprod
>   1.419   0.039   0.863 
>    user  system elapsed    -- CUDA gpuCrossprod
>   1.431   0.119   1.718 

As you can see, my CPU/RAM is about 5x slower than your machine, CUDA is slightly slower (my card has 32 cores, but may have lower memory bandwidth and/or clock rate if yours is a desktop card), but vecLib BLAS beats CUDA by a factor of 2.

Kudos to the gputools developers: despite what the README says, the package compiles out of the box on Mac OS X 10.6, 64-bit R 2.12.1, with CUDA release 3.2.  Thanks for this convenient package!

Best regards,
Stefan Evert

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]