[Rd] Peculiar timing result

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Mar 11 12:29:45 CET 2006


Here is a summary of some results on a dual Opteron 252 running FC3

64-bit gcc 3.4.5
R's blas		34.83  3.45 38.56
ATLAS			36.70  3.28 40.14
ATLAS multithread	76.85  5.39 82.29
Goto 1 thread		36.17  3.44 39.76
Goto multithread       178.06 345.97 467.99
ACML			49.69  3.36 53.23

64-bit gcc 4.1.0
R's blas		34.98  3.49 38.55
32-bit gcc 3.4.5
R's blas		33.72  3.27 36.99
32-bit gcc 4.1.0
R's blas		34.62  3.25 37.93

The timings are not that repeatable, but the message seems clear that
this problem does not benefit from a tuned BLAS and the overhead from 
multiple threads is harmful.  (The gcc 4.1.0 results took fewer 
iterations, which skews the results in its favour.)

And my 2GHz Pentium M laptop under Windows gave 39.96  3.68 44.06.

Clearly the Goto BLAS has a problem here: the results are slower on a dual 
252 than a dual 248 (see below).


On Fri, 3 Mar 2006, Prof Brian Ripley wrote:

> On Fri, 3 Mar 2006, Douglas Bates wrote:
>
>> I have been timing a particular model fit using lmer on several
>> different computers and came up with a peculiar result - the model fit
>> is considerably slower on a dual-core Athlon 64 using Goto's
>> multithreaded BLAS than on a single-core processor.
>
> Is there a Goto BLAS tuned for that chip?  I can only see one tuned for an
> (unspecified) Opteron.  L1 and L2 cache sizes do sometimes matter a lot
> for tuned BLAS, and (according to the AMD site I just looked up) the X2
> 3800+ only has a 512Kb per core L2 cache.  Opterons have a 1Mb L2 cache.
>
> Also, the very large system time seen in the dual-core run is typical of
> what I see when pthreads is not working right, and I suggest you try a
> limit of one thread (see the R-admin manual).  On our dual-processor
> Opteron 248 that ran in 44 secs instead of 328.
>
>> Here is the timing on a single-core Athlon 64 3000+ running under
>> today's R-devel with version 0.995-5 of the Matrix package.
>>
>>> library(Matrix)
>>> data(star, package = 'mlmRev')
>>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
> control = list(nit=0,grad=0,msV=1)))
>> [1] 43.10  3.78 48.41  0.00  0.00
>>
>>
>> (If you run the timing yourself and don't want to see the iteration
>> output, take the msV=1 out of the control list.  I keep it in there so
>> I can monitor the progress.)
>>
>> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
>> the same version of R, BLAS and Matrix package, the timing ends up
>> with something like
>>
>> 90 140 235 0 0
> ....
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list