[Rd] Tesla GPUs

Fri Aug 5 20:36:47 CEST 2011

On Jul 19, 2011, at 12:56 PM, Simon Urbanek wrote:

> 
> On Jul 19, 2011, at 2:26 AM, Prof Brian Ripley wrote:
> 
>> On Mon, 18 Jul 2011, Alireza Mahani wrote:
>> 
>>> Simon,
>>> 
>>> Thank you for elaborating on the limitations of R in handling float types. I
>>> think I'm pretty much there with you.
>>> 
>>> As for the insufficiency of single-precision math (and hence limitations of
>>> GPU), my personal take so far has been that double-precision becomes crucial
>>> when some sort of error accumulation occurs. For example, in differential
>>> equations where boundary values are integrated to arrive at interior values,
>>> etc. On the other hand, in my personal line of work (Hierarchical Bayesian
>>> models for quantitative marketing), we have so much inherent uncertainty and
>>> noise at so many levels in the problem (and no significant error
>>> accumulation sources) that single vs double precision issue is often
>>> inconsequential for us. So I think it really depends on the field as well as
>>> the nature of the problem.
>> 
>> The main reason to use only double precision in R was that on modern CPUs double precision calculations are as fast as single-precision ones, and with 64-bit CPUs they are a single access.  So the extra precision comes more-or-less for free.  You also under-estimate the extent to which stability of commonly used algorithms relies on double precision.  (There are stable single-precision versions, but they are no longer commonly used.  And as Simon said, in some cases stability is ensured by using extra precision where available.)
>> 
>> I disagree slightly with Simon on GPUs: I am told by local experts that the double-precision on the latest GPUs (those from the last year or so) is perfectly usable.  See the performance claims on http://en.wikipedia.org/wiki/Nvidia_Tesla of about 50% of the SP performance in DP.
>> 
> 
> That would be good news. Unfortunately those seem to be still targeted at a specialized market and are not really graphics cards in traditional sense. Although this is sort of required for the purpose it removes the benefit of ubiquity. So, yes, I agree with you that it may be an interesting way forward, but I fear it's too much of a niche to be widely supported. I may want to ask our GPU specialists here to see if they have any around so I could re-visit our OpenCL R benchmarks. Last time we abandoned our OpenCL R plans exactly due to the lack of speed in double precision.
> 

A quick update - it turns out we have a few Tesla/Fermi machines here, so I ran some very quick benchmarks on them. The test case was the same as for the original OpenCL comparisons posted here a while ago when Apple introduced it: dnorm on long vectors:

64M, single:
-- GPU -- total: 4894.1 ms, compute: 234.5 ms, compile: 4565.7 ms, real: 328.3 ms
-- CPU -- total: 2290.8 ms

64M, double:
-- GPU -- total: 5448.4 ms, compute: 634.1 ms, compile: 4636.4 ms, real: 812.0 ms
-- CPU -- total: 2415.8 ms

128M, single:
-- GPU -- total: 5843.7 ms, compute: 469.2 ms, compile: 5040.5 ms, real: 803.1 ms
-- CPU -- total: 4568.9 ms

128M, double:
-- GPU -- total: 6042.8 ms, compute: 1093.9 ms, compile: 4583.3 ms, real: 1459.5 ms
-- CPU -- total: 4946.8 ms

The CPU times are based on a dual Xeon X5690 machine (12 cores @ 3.47GHz) using OpenMP, but are very approximate, because there were two other jobs running on machine -- still, it should be a good ballpark figure. The GPU times are run on Tesla S2050 using OpenCL, addressed as one device so presumably comparable to the performance of one Tesla M2050.
The figures to compare are GPU.real (which is computation + host memory I/O) and CPU.total, because we can assume that we can compile the kernel in advance, but you can't save on the memory transfer (unless you find a good way to chain calls which is not realistic in R).

So the good news is that the new GPUs fulfill their promise : double precision is only twice as slow as single precision. Also they scale approximately linearly - see the real time of 64M double is almost the same as 128M single. They also outperform the CPUs as well, although not by an order of magnitude.

The double precision support is very good news, and even though we are still using GPUs in a suboptimal manner, they are faster than the CPUs. The only practical drawback is that using OpenCL requires serious work, it's not as easy as slapping omp pragmas on existing code. Also the HPC Teslas are quite expensive so I don't expect to see them in desktops anytime soon. However, for people that are thinking about big computation, it may be an interesting way to go. Given that it's not mainstream I don't expect core R to have OCL support just yet, but it may be worth keeping in mind for the future as we are designing the parallelization framework in R.

Cheers,
Simon