[Rd] Peculiar timing result
Douglas Bates
bates at stat.wisc.edu
Tue Mar 14 02:10:06 CET 2006
On 3/11/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
> Here is a summary of some results on a dual Opteron 252 running FC3
>
> 64-bit gcc 3.4.5
> R's blas 34.83 3.45 38.56
> ATLAS 36.70 3.28 40.14
> ATLAS multithread 76.85 5.39 82.29
> Goto 1 thread 36.17 3.44 39.76
> Goto multithread 178.06 345.97 467.99
> ACML 49.69 3.36 53.23
>
> 64-bit gcc 4.1.0
> R's blas 34.98 3.49 38.55
> 32-bit gcc 3.4.5
> R's blas 33.72 3.27 36.99
> 32-bit gcc 4.1.0
> R's blas 34.62 3.25 37.93
>
> The timings are not that repeatable, but the message seems clear that
> this problem does not benefit from a tuned BLAS and the overhead from
> multiple threads is harmful. (The gcc 4.1.0 results took fewer
> iterations, which skews the results in its favour.)
>
> And my 2GHz Pentium M laptop under Windows gave 39.96 3.68 44.06.
>
> Clearly the Goto BLAS has a problem here: the results are slower on a dual
> 252 than a dual 248 (see below).
Thanks for the information on the timings. It happens that this
message ended up in my R-help folder and I only got around to reading
that folder today.
Is it ok with you if I forward this message to Simon Urbanek? I am
having similar difficulties in the timing with R on a dual-core Intel
MacBook.
>
>
> On Fri, 3 Mar 2006, Prof Brian Ripley wrote:
>
> > On Fri, 3 Mar 2006, Douglas Bates wrote:
> >
> >> I have been timing a particular model fit using lmer on several
> >> different computers and came up with a peculiar result - the model fit
> >> is considerably slower on a dual-core Athlon 64 using Goto's
> >> multithreaded BLAS than on a single-core processor.
> >
> > Is there a Goto BLAS tuned for that chip? I can only see one tuned for an
> > (unspecified) Opteron. L1 and L2 cache sizes do sometimes matter a lot
> > for tuned BLAS, and (according to the AMD site I just looked up) the X2
> > 3800+ only has a 512Kb per core L2 cache. Opterons have a 1Mb L2 cache.
> >
> > Also, the very large system time seen in the dual-core run is typical of
> > what I see when pthreads is not working right, and I suggest you try a
> > limit of one thread (see the R-admin manual). On our dual-processor
> > Opteron 248 that ran in 44 secs instead of 328.
> >
> >> Here is the timing on a single-core Athlon 64 3000+ running under
> >> today's R-devel with version 0.995-5 of the Matrix package.
> >>
> >>> library(Matrix)
> >>> data(star, package = 'mlmRev')
> >>> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star,
> > control = list(nit=0,grad=0,msV=1)))
> >> [1] 43.10 3.78 48.41 0.00 0.00
> >>
> >>
> >> (If you run the timing yourself and don't want to see the iteration
> >> output, take the msV=1 out of the control list. I keep it in there so
> >> I can monitor the progress.)
> >>
> >> If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
> >> the same version of R, BLAS and Matrix package, the timing ends up
> >> with something like
> >>
> >> 90 140 235 0 0
> > ....
> >
> >
>
> --
> Brian D. Ripley, ripley at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
More information about the R-devel
mailing list