[R-SIG-Mac] How to Speed up R on the G5

Tue Feb 8 01:21:07 CET 2005

On 08/02/2005, at 3:19 AM, Jake Bowers wrote:
> I've been receiving some friendly grief from a friend with a Linux
> dual-Opteron system about the performance of his R package on the OS X 
> G5
> system.
>
> He has suggested recompiling R-patched with a variety of different
> compilers and compiler flags. And has also suggested just recompiling
> his package with different flags and compilers (while leaving
> r-patched as I have currently built it using gcc 3.3 20030304 (Apple
> Computer, Inc. build 1671), and g77 3.4.2 (from that wonderful site:
> hpc.sf.net)).

Apple have put up a good article on Performance optimisation:
http://developer.apple.com/tools/sharkoptimize.html

The moral is: 'measure first.  Futz afterwards.'  You really must start 
by finding out where the program is spending the time.  There is 
absolutely no point optimising code that is rarely called.  There can 
be huge gains from optimising very small amounts of heavily used code.  
On the G4/G5 architectures, the big gains come if you can vectorise any 
of that heavily used code.  If you are not using Altivec, half the CPU 
is hanging around doing nothing.
>
> My second question is whether there are ways other than using
> --with-blas="-framework vecLib", to take advantage of what I thought
> was the power of the G5 (or dual G5s in my case).

Run top and see if you are using both cpus.  If not then Rmpi or 
something like that may pay big dividends.
>

> Here is what I'm playing with:
>
> 1) One set of builds with standard compilers and flags
> (--with-blas="-framework vecLib" --with-lapack")

I would take the advice from hpc.sf.net and just use the -fast flag, 
but only on code that you know from profiling to be time critical.  
There is a downside as others have observed below.
>
> 2) One build like (1) but using the libgoto.dylib version of BLAS and
> the vecLib stuff for lapack (It doesn't work with just
> --with-blas"-L/usr/local/lib -lgoto"
> --with-lapack). 
> (http://www.cs.utexas.edu/users/kgoto/signup_first.html#For_OS_X)

IMHO from what I see on Goto's site, I doubt that libgoto will do 
anything for the G5 architecture.  The Power 3 data he shows indicates 
little or no benefit.  His optimisations seem to work well for x86 and 
Alpha.  However, none of this matters, if you are not spending much 
time in the BLAS library.

> but, although this compiled ok, it failed the make check on the first 
> test (base-Ex.R with:
>> tx0 <- c(9, 4, 6, 5, 3, 10, 5, 3, 5)
>> x <- rep(0:8, tx0)
>> stopifnot(table(x) == tx0)
> Warning in table(x) == tx0 : longer object length
> 	is not a multiple of shorter object length
> Error in stopifnot(table(x) == tx0) : dim<- : dims [product 8] do not 
> match the length of object [9]
> Execution halted)
>
Only optimise where it matters.  It may cause problems.

> Finally, he suggested looking into the AbSoft compilers. But, I
> figured I'd save my money and see if other folks have had luck with
> those yet.

As far as I can see the IBM (not Absoft) xlf and xlc compilers are 
significantly faster, although Apple is working hard on gcc to close 
the gap.

Other thoughts:
1.  I don't think there is any point wasting time on Fortran.  The base 
R distribution as built on a Mac uses no Fortran code. As far as I can 
see very few R packages use Fortran.

2.  Some one else mentioned MCMCs.  These are embarassingly parallel 
applications and if they are not using both CPUs they are going to be 
inefficient.

Finally some (so far very preliminary) experience:
I have spent a little time on JAGS, a WinBUGS (MCMC) work alike which 
uses the standalone libRmath.  Running the WinBUGS kidney example, this 
code spends almost all its time in the libm functions power, exp and 
log which are called from the Weibull distribution functions in R.  
AFAIK these are not vectorised. At the moment I not comparing Mac vs PC 
but WinBUGS vs JAGS.  The author of JAGS thinks the sampling code is 
inefficient, hence the libm functions are called too often.  I am 
interested in trying to replace the calls through libRmath into libm 
with vectorised code, which I suspect will be much more effective on 
the Mac.

Of course aggressively optimising the compilation of the JAGS code 
makes absolutely no discernable difference to overall performance in my 
example.

Bill Northcott