[R] Speed up sum of outer products?
Stefan Evert
stefanML at collocations.de
Tue Mar 15 12:10:12 CET 2011
Hi Ajay,
thanks for this comparison, which prodded me to give CUDA another try on my now somewhat aging MacBook Pro.
> Hi Dennis, sorry for the delayed reply and thanks for the article. I digged
> into it and found that if you have a GPU, the CUBLAS library beats the
> BLAS/ATLAS implementation in the Matrix package for 'large' problems.
I guess you have a very fast CPU (Core i7 or so, I guess?), a very poor BLAS implementation and a desktop graphics card?
> user system elapsed -- for loop, single thread
> 27.210 6.680 33.342
> user system elapsed -- BLAS mat mult
> 6.260 0.000 5.982
> user system elapsed -- BLAS crossprod
> 4.340 0.000 4.284
> user system elapsed -- CUDA gpuCrossprod
> 1.49 0.00 1.48
Just to put these numbers in perspective, here are my results for a MacBook Pro running Mac OS X 10.6.6 (Core 2 Duo, 2.5 GHz, 6 GB DDR2 RAM, Nvidia GeForce 8600M GT with 512 MB RAM -- I suppose it's the "M" that breaks my performance here).
> user system elapsed -- for loop, single thread
> 141.034 35.299 153.783
> user system elapsed -- BLAS mat mult
> 2.791 0.025 1.805
> user system elapsed -- BLAS crossprod
> 1.419 0.039 0.863
> user system elapsed -- CUDA gpuCrossprod
> 1.431 0.119 1.718
As you can see, my CPU/RAM is about 5x slower than your machine, CUDA is slightly slower (my card has 32 cores, but may have lower memory bandwidth and/or clock rate if yours is a desktop card), but vecLib BLAS beats CUDA by a factor of 2.
Kudos to the gputools developers: despite what the README says, the package compiles out of the box on Mac OS X 10.6, 64-bit R 2.12.1, with CUDA release 3.2. Thanks for this convenient package!
Best regards,
Stefan Evert
[ stefan.evert at uos.de | http://purl.org/stefan.evert ]
More information about the R-help
mailing list