[Rd] compiling R | multi-Opteron | BLAS source
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Aug 1 19:42:48 CEST 2006
The R-devel version of R provides a pluggable BLAS, which makes such tests
fairly easy (although building the BLAS themselves is not). On dual
Opterons, using multiple threads is often not worthwhile and can be
counter-productive (Doug Bates has found some dramatic examples, and you
can see them in my timings below).
So timings for FC3, gcc 3.4.6, dual Opteron 252, 64-bit build of R. ACML
3.5.0 is by far the easiest to install (on R-devel all you need to do is
to link libacml.so to lib/libRblas.so) and pretty competitive, so that is
what I normally use.
These timings are not very repeatable: to a few % only even after
averaging quite a few runs.
set.seed(123)
X <- matrix(rnorm(1e6), 1000)
system.time(for(i in 1:25) X%*%X)
system.time(for(i in 1:25) solve(X))
system.time(for(i in 1:10) svd(X))
internal BLAS (-O3)
> system.time(for(i in 1:25) X%*%X)
[1] 96.939 0.341 97.375 0.000 0.000
> system.time(for(i in 1:25) solve(X))
[1] 110.316 1.652 112.006 0.000 0.000
> system.time(for(i in 1:10) svd(X))
[1] 165.550 1.131 166.806 0.000 0.000
Goto 1.03, 1 thread
> system.time(for(i in 1:25) X%*%X)
[1] 12.949 0.191 13.143 0.000 0.000
> system.time(for(i in 1:25) solve(X))
[1] 23.201 1.449 24.652 0.000 0.000
> system.time(for(i in 1:10) svd(X))
[1] 43.318 1.016 44.361 0.000 0.000
Goto 1.03, dual CPU
> system.time(for(i in 1:25) X%*%X)
[1] 15.038 0.244 8.488 0.000 0.000
> system.time(for(i in 1:25) solve(X))
[1] 26.569 2.239 19.814 0.000 0.000
> system.time(for(i in 1:10) svd(X))
[1] 59.912 1.799 50.350 0.000 0.000
ACML 3.5.0 (single-threaded)
> system.time(for(i in 1:25) X%*%X)
[1] 13.794 0.368 14.164 0.000 0.000
> system.time(for(i in 1:25) solve(X))
[1] 22.990 1.695 24.710 0.000 0.000
> system.time(for(i in 1:10) svd(X))
[1] 48.267 1.373 49.662 0.000 0.000
ATLAS 3.6.0, single-threaded
> system.time(for(i in 1:25) X%*%X)
[1] 16.164 0.404 16.572 0.000 0.000
> system.time(for(i in 1:25) solve(X))
[1] 26.200 1.704 27.907 0.000 0.000
> system.time(for(i in 1:10) svd(X))
[1] 50.150 1.462 51.619 0.000 0.000
ATLAS 3.6.0, multi-threaded
> system.time(for(i in 1:25) X%*%X)
[1] 17.657 0.468 9.775 0.000 0.000
> system.time(for(i in 1:25) solve(X))
[1] 38.388 2.353 30.141 0.000 0.000
> system.time(for(i in 1:10) svd(X))
[1] 95.611 3.039 88.917 0.000 0.000
On Sun, 23 Jul 2006, Evan Cooch wrote:
> Greetings -
>
> A quick perusal of some of the posts to this maillist suggest the level
> of the questions is probably beyond someone working at my level, but at
> the risk of looking foolish publicly (something I find I get
> increasingly comfortable with as I get older), here goes:
>
> My research group recently purchased a multi-Opteron system (bunch of
> 880 chips), running 64-bit RHEL 4 (which we have site licensed at our
> university, so it cost us nothing - good price) with SMP support built
> into the kernel (perhaps obviously, for a multi-pro system). Several of
> our user use [R], which I've only used on a few occasions. However, it
> is part of my task to get [R] installed for folks using this system.
>
> While the simple, basic compile sequence (./configure, make, make check,
> make install) went smoothly, its pretty clear from our benchmarks that
> the [R] code isn't running as 'rocket-fast' as it should for a system
> like this. So, I dig a bit deeper. Most of the jobs we want to run could
> benefit from BLAS support (lots of array manipulations and other bits of
> linear algebra), and a few other compilation optimizations - and here is
> where I seek advice.
>
> 1) Looks like there are 3-4 flavours: LAPACK, ATLAS, ACML
> (AMD-chips...), and Goto. In reading what I can find, it seems that
> there are reasons not to use ACML (single-thread) despite the AMD chips,
> reasons to avoid ATLAS (some hassles compiling on RHEL 4 boxes), reasons
> to avoid LAPACK (ibid), but apparently no problems with Goto BLAS.
>
> Is that a reasonable summary? At the risk of starting a larger
> discussion, I'm simply looking to get BLAS support, yielding the fastest
> [R] code with the minimum of hassles (while tweaking lines of configure
> fies, weird linker sequences and all that used to appeal when I was a
> student, I don't have time at this stage). So, any quick recommendation
> for *which* BLAS library? My quick assessment suggests goto BLAS, but
> I'm hoping for some confirmation.
>
> 3) compilation of BLAS - I can compile for 32-bit, or 64-bit.
> Presumably, given we've invested in 64-bit chips, and a 64-bit OS, we'd
> like to consider a 64-bit compilation. Which, also presumably, means
> we'd need 64-bit compilation for [R]. While I've read the short blurb on
> CRAN concerning 64-bi vs 32-bit compilations (data size vs speed), I'd
> be happy to have both on our machine. But, I'm not sure how one
> specifies 64-bits in the [R] compilation - what flags to I need to set
> during ./configure, or what config file do I need to edit?
>
> Thanks very much in advance - and, again, apologies for the 'low-level'
> of these questions, but one needs to start somewhere.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list