[Rd] Peculiar timing result
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sat Mar 11 12:29:45 CET 2006
Here is a summary of some results on a dual Opteron 252 running FC3
64-bit gcc 3.4.5
R's blas 34.83 3.45 38.56
ATLAS 36.70 3.28 40.14
ATLAS multithread 76.85 5.39 82.29
Goto 1 thread 36.17 3.44 39.76
Goto multithread 178.06 345.97 467.99
ACML 49.69 3.36 53.23
64-bit gcc 4.1.0
R's blas 34.98 3.49 38.55
32-bit gcc 3.4.5
R's blas 33.72 3.27 36.99
32-bit gcc 4.1.0
R's blas 34.62 3.25 37.93
The timings are not that repeatable, but the message seems clear that
this problem does not benefit from a tuned BLAS and the overhead from
multiple threads is harmful. (The gcc 4.1.0 results took fewer
iterations, which skews the results in its favour.)
And my 2GHz Pentium M laptop under Windows gave 39.96 3.68 44.06.
Clearly the Goto BLAS has a problem here: the results are slower on a dual
252 than a dual 248 (see below).
On Fri, 3 Mar 2006, Prof Brian Ripley wrote:
> On Fri, 3 Mar 2006, Douglas Bates wrote:
>
>> I have been timing a particular model fit using lmer on several
>> different computers and came up with a peculiar result - the model fit
>> is considerably slower on a dual-core Athlon 64 using Goto's
>> multithreaded BLAS than on a single-core processor.
>
> Is there a Goto BLAS tuned for that chip? I can only see one tuned for an
> (unspecified) Opteron. L1 and L2 cache sizes do sometimes matter a lot
> for tuned BLAS, and (according to the AMD site I just looked up) the X2
> 3800+ only has a 512Kb per core L2 cache. Opterons have a 1Mb L2 cache.
>
> Also, the very large system time seen in the dual-core run is typical of
> what I see when pthreads is not working right, and I suggest you try a
> limit of one thread (see the R-admin manual). On our dual-processor
> Opteron 248 that ran in 44 secs instead of 328.
>
>> Here is the timing on a single-core Athlon 64 3000+ running under
>> today's R-devel with version 0.995-5 of the Matrix package.
>>
>>> library(Matrix)
>>> data(star, package = 'mlmRev')
>>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
> control = list(nit=0,grad=0,msV=1)))
>> [1] 43.10 3.78 48.41 0.00 0.00
>>
>>
>> (If you run the timing yourself and don't want to see the iteration
>> output, take the msV=1 out of the control list. I keep it in there so
>> I can monitor the progress.)
>>
>> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
>> the same version of R, BLAS and Matrix package, the timing ends up
>> with something like
>>
>> 90 140 235 0 0
> ....
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list