[Rd] Peculiar timing result
Douglas Bates
bates at stat.wisc.edu
Fri Mar 3 16:45:11 CET 2006
I have been timing a particular model fit using lmer on several
different computers and came up with a peculiar result - the model fit
is considerably slower on a dual-core Athlon 64 using Goto's
multithreaded BLAS than on a single-core processor.
Here is the timing on a single-core Athlon 64 3000+ running under
today's R-devel with version 0.995-5 of the Matrix package.
> library(Matrix)
> data(star, package = 'mlmRev')
> system.time(fm1 <- lmer(math~gr+sx+eth+cltype+(yrs|id)+(1|tch)+(yrs|sch), star, control = list(nit=0,grad=0,msV=1)))
0 241720.: 1.16440 0.335239 0.00000 1.78732 0.867209 0.382318 0.00000
1 239722.: 1.94952 5.00000e-10 0.00933767 1.65999 0.858003
0.341520 0.00908757
2 239580.: 1.95924 0.0884059 0.00933767 1.65308 0.857487
0.339296 0.00954718
3 239215.: 2.60877 0.0765848 0.0177699 1.45739 0.843562
0.275100 0.0236849
4 239204.: 2.62582 0.106670 0.0239698 1.41976 0.841086
0.261033 0.0267073
5 239176.: 2.63149 0.0787924 0.0367185 1.37952 0.838527
0.245076 0.0301134
6 239141.: 2.64949 0.108534 0.0594586 1.28846 0.832543
0.208404 0.0375456
7 239049.: 2.64794 0.0789214 0.121782 1.10436 0.819711
0.126101 0.0524965
8 239004.: 2.66044 0.117957 0.181505 0.932202 0.798982
0.0718116 0.0628958
9 238944.: 2.66310 0.0819653 0.334477 0.631735 0.740855
0.258359 0.0806590
10 238893.: 2.72626 0.0975205 0.653432 0.703912 0.666067
0.109922 0.201809
11 238892.: 2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591
12 238888.: 2.75052 0.0990238 0.689199 0.694588 0.655781
0.106516 0.216460
13 238861.: 2.80325 0.126935 1.05912 0.733914 0.556162 0.159296 0.360938
14 238832.: 2.82656 0.117617 1.59471 0.607916 0.441371
0.0749944 0.976142
15 238811.: 2.87350 0.136332 1.59046 0.653141 0.353763 0.226061 1.79285
16 238810.: 2.87663 0.125135 1.58992 0.656808 0.352605 0.220488 1.79282
17 238806.: 2.89342 0.141551 1.58607 0.676523 0.344212 0.181833 1.79268
18 238804.: 2.90080 0.125137 1.56624 0.682921 0.261295 0.180598 1.74325
19 238802.: 2.91950 0.128569 1.56836 0.680436 0.336051 0.159940 1.80400
20 238801.: 2.92795 0.134762 1.56597 0.685121 0.331695 0.145547 1.80414
21 238801.: 2.93741 0.125667 1.56139 0.687827 0.332700 0.138854 1.81495
22 238800.: 2.94588 0.131757 1.55294 0.687909 0.330414 0.137834 1.82826
23 238799.: 2.96867 0.129716 1.52943 0.688678 0.323171 0.139912 1.84615
24 238799.: 2.98994 0.133378 1.52188 0.700038 0.337387 0.131403 1.77623
25 238799.: 3.00312 0.135308 1.51475 0.697550 0.311750 0.145683 1.78037
26 238799.: 3.00461 0.129920 1.51083 0.697666 0.306722 0.138745 1.81188
27 238799.: 3.00504 0.134882 1.50539 0.696745 0.302949 0.135897 1.84405
28 238799.: 3.00422 0.134013 1.47947 0.698115 0.303243 0.133806 1.86486
29 238799.: 3.00819 0.134378 1.48185 0.701871 0.307097 0.134637 1.84996
30 238799.: 3.01313 0.134279 1.49098 0.702883 0.304788 0.133682 1.86254
31 238799.: 3.01291 0.134253 1.49060 0.701818 0.303155 0.133771 1.84613
32 238799.: 3.01142 0.134314 1.48921 0.701782 0.303589 0.134439 1.84653
33 238799.: 3.01174 0.134315 1.48926 0.701641 0.304120 0.134145 1.84635
34 238799.: 3.01175 0.134304 1.48942 0.701740 0.303762 0.134185 1.84649
35 238799.: 3.01173 0.134307 1.48937 0.701724 0.303809 0.134206 1.84647
[1] 43.10 3.78 48.41 0.00 0.00
(If you run the timing yourself and don't want to see the iteration
output, take the msV=1 out of the control list. I keep it in there so
I can monitor the progress.)
If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
the same version of R, BLAS and Matrix package, the timing ends up
with something like
90 140 235 0 0
I do see that the multithreading is working on a calculation that is
essentially BLAS-bound such as
> mm <- Matrix(rnorm(1e6), nc = 1e3)
> system.time(crossprod(mm))
[1] 0.57 0.02 0.60 0.00 0.00
On the X2 processor it still takes about 0.6 seconds user time but
only 0.3 seconds elapsed time when the machine is otherwise idle and
both cores are available for the calculation.
Any suggestions why the dual-core processor is so much slower than the
single core processor?
By the way, I would be interested in accumulating timings of this
model fit on other systems. If you do time it please send me
(off-list) a summary of the version of R, version of the accelerated
BLAS if you use them, processor speed and configuration (i.e.
multiprocessor, multicore, etc.) and, if you know it, memory speed.
This is an example of a complex multilevel model with crossed grouping
factors fit to a relatively large (30000 observations on 10000
students, 1400 teachers and 80 schools) data set.
More information about the R-devel
mailing list