[R-SIG-Mac] restricting the number of threads used on a dual-core Intel machine

Wed Mar 8 20:06:29 CET 2006

Doug,

quick summary: AFAIK you can't w/o re-compiling and even if you do,  
it doesn't help

The CRAN Intel Mac binary of R uses threaded ATLAS. According to the  
documentation ATLAS doesn't allow you to change the number of threads  
it uses - it is wired-in at compile time. Moreover POSIX threads are  
just a specification. OMP_NUM_THREADS only applies to OpenMP (and  
compatible?) as far as I can tell, it doesn't apply to POSIX threads  
in general. OS X uses mach threads, pthread is just a wrapper and I  
didn't find any documentation that would allow you to limit the  
number of threads at that level.

Now, back to your examples. The main reason for using threaded ATLAS  
as opposed to vecLib is that it is more than twice as fast (vecLib  
doesn't use threads). Here are the timings for ATLAS (multi-threaded)  
vs vecLib (single-thread):

1e6 crossprod:
vecLib  0.78 0.02 0.80
ATLAS 0.66 0.03 0.36

As expected, threaded ATLAS is more than twice as fast as vecLib.

vecLib 113.47 163.36 277.23
ATLAS 94.04 141.29 235.80

However, ATLAS used only 34 iterations instead of 37, but you can  
still say that they are not far apart. Please note the extremely high  
system time in the lmer example. I ran some profiling and 60% of the  
entire runtime is spent in malloc/free. Also the load is only on one  
CPU, so both are effectively using only one thread here.

I also ran the example on a dual G5 with threaded vecLib and it  
clearly cannot take advantage of multiple CPUs and again, we have an  
extremely high system time:
63.12 120.94 185.96
and again, sure enough, almost all of that system time comes from  
malloc. The heaviest call tree is r_cholmod_super_numeric(Matrix)- 
 >dsyrk_, so that is where all the time is spent.

As a side note - vecLib on both platforms uses ATLAS internally. The  
degree of optimization varies - especially on Intel Macs a tuned  
ATLAS for Code Duo is faster than vecLib which is probably due to  
compiler choice (I was using the more recent gcc 4.0.3), tuning for a  
particular CPU and maybe the fact that vecLib is a shared library.

Cheers,
Simon

On 8.3.2006, at 9:51, Douglas Bates wrote:

> A few days ago I wrote to r-devel about a curious timing result where
> a dual-core Athlon 64 was much slower than a single core Athlon 64 on
> the same task in R.  I noticed that the accelerated BLAS, either
> Goto's BLAS or ACML (AMD Core Mathematics Library) were using two
> threads on the dual-core machine.  It turns out that multithreading
> was the cause of the slow performance of R on this task.  Setting
> OMP_NUM_THREADS=1 in the environment slows down a BLAS-bound
> calculation but gives much faster performance on the R task.  The use
> of this environment variable is mentioned in an appendix of the R
> Installation and Administration manual.
>
> So my question is how does one set this environment variable for the R
> Console on an Intel Mac or even R running in a terminal on an Intel
> Mac?  I tried setting OMP_NUM_THREADS=1 in the environment before
> running R in a terminal on a Mac but that did not seem to have an
> effect.
>
> To check whether you are using multiple threads you can run
>
> mm <- matrix(rnorm(1e6), ncol = 1000)
> for (i in 1:10) print(system.time(crossprod(mm)))
>
> I do the timing multiple times because sometimes it will only use 1
> thread for the first few cases then switch to multiple threads.  If
> the elapsed time (third element of the timing result) is less than the
> user time (first element) you are using multiple threads.  For example
>
>> for (i in 1:30) print(system.time(crossprod(mm)))
> [1] 0.65 0.02 0.35 0.00 0.00
> [1] 0.65 0.03 0.36 0.00 0.00
> [1] 0.65 0.03 0.35 0.00 0.00
> [1] 0.65 0.02 0.35 0.00 0.00
> [1] 0.65 0.02 0.35 0.00 0.00
> [1] 0.66 0.03 0.35 0.00 0.00
> [1] 0.65 0.02 0.35 0.00 0.00
> [1] 0.65 0.03 0.36 0.00 0.00
> [1] 0.65 0.03 0.35 0.00 0.00
> [1] 0.66 0.02 0.35 0.00 0.00
> [1] 0.65 0.03 0.36 0.00 0.00
> [1] 0.65 0.03 0.36 0.00 0.00
> [1] 0.66 0.02 0.35 0.00 0.00
> [1] 0.66 0.03 0.35 0.00 0.00
> [1] 0.65 0.03 0.36 0.00 0.00
>
> To see that this slows down some computations install the Matrix and
> mlmRev packages and try
>
> library(Matrix)
> data(star, package = 'mlmRev')
> system.time(fm1 <-
> lmer(math~sx+eth+gr+cltype+(yrs|id)+(1|tch)+(yrs| 
> sch),star,control=list(nit=0,grad=0,msV=1)))
>
> The iterations should converge around
>
>  37      238799.:  3.01178 0.134283  1.48933 0.701769 0.303707  
> 0.134235  1.84660
>  38      238799.:  3.01173 0.134308  1.48939 0.701726 0.303810  
> 0.134202  1.84648
>
> and give a timing like
>
> [1] 119.86 165.42 285.71   0.00   0.00
>
> The very large system time is indicative of problems with multiple  
> threads.
>
> I got a similar result on the dual-core Athlon 64.  After setting the
> number of threads to 1 the timing is
>
> [1] 34.74  2.48 37.22  0.00  0.00
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>
>