[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)
Martin Maechler
maechler at stat.math.ethz.ch
Tue Mar 2 11:55:19 MET 2004
>>>>> "BDR" == Prof Brian Ripley <ripley at stats.ox.ac.uk>
>>>>> on Fri, 27 Feb 2004 18:22:29 +0000 (GMT) writes:
BDR> On 27 Feb 2004, Douglas Bates wrote:
>> Martin Maechler <maechler at stat.math.ethz.ch> writes:
>>
>> > >>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>> > >>>>> on 26 Feb 2004 15:44:16 +0100 writes:
>> >
>> > PD> Douglas Bates <bates at stat.wisc.edu> writes:
>> > >> Have you tried configuring R with Goto's BLAS
>> > >> http://www.cs.utexas.edu/users/kgoto/
>> > >>
>> > >> I haven't worked with Opteron or Athlon64 computers but I understand
>> > >> that Goto's BLAS are very effective on those machines. Furthermore
>> > >> Goto's BLAS are (only) available as .so libraries so you don't need to
>> > >> mess with creating the .so version.
>> >
>> > PD> I tried it, yes. Somewhat to my surprise, it seemed to be not quite as
>> > PD> fast as the threaded ATLAS, but I wasn't very systematic about the
>> > PD> benchmarking.
>> >
>> > PD> (and the Goto items have license issues, which get in the way for
>> > PD> binary distributions.)
>> >
>> > Thanks a lot, Peter, Brian, Doug, for your feedbacks!
>> > In the mean time, I have three running versions of R(-devel) on
>> > the 64-Opteron
>> > - "plain"
>> > - linked against threaded GOTO
>> > - linked against threaded (static) ATLAS (using -fPIC for compilation;
>> > "large" Rlapack)
>> > and I find that GOTO is faster than ATLAS
>> > consistently (between ~ 5-20%) for several tests
>> > (square matrices; %*% and solve).
>> > ATLAS is still an order of magnitude faster than "plain" for
>> > 3000x3000 matrices.
>>
>> Would you be willing to post a brief summary of comparative timings?
>>
>> I have thought at times that it may be worthwhile collecting
>> comparative timings for different combinations of
>> processor/OS/memory size and speed/
>> on "typical" tasks in R. As with any benchmark the results will
>> artificial but they can be of some help when considering what hardware
>> to purchase. Bioconductor users may find it particularly helpful to
>> be able to evaluate how much they will need to pay to be able to
>> analyze large data sets reasonably quickly.
>>
>> One easily-obtained timing is at the end of
>> $RSRC/tests/Examples/base-Ex.Rout after 'make;make check'.
BDR> That one is I think rather too artificial, as it contains few even
BDR> moderately large examples, and is dominated by a few atypical tasks.
BDR> I tend to use the sum of the MASS scripts as an
BDR> informal timing: ch06.R is also a pretty good indicator.
BDR> I think you will find that BLAS differences are pretty
BDR> small in real-life analyses, or at least I always have.
I've now done a bit more systematic testing using more realistic
code than the large-matrix (1000^2 and 3000^2)
number crunching I did last week.
As expected, the differences disappear for VR/scripts "ch06.R"
(there's even a slight indication of GOTO being worse than no
optimized BLAS, but probably that was a random fluctuation) and
also for the "make check" outputs.
Here is a nice R function that can be used by others as well for
getting the numbers for the "make check" (or better "make
check-all") outputs. Note that it's interesting to also get the
times for the recommended packages.
#### After "make check-all" there are quite some files with timings
#### --------------
#### Get at these
## In a Unix shell, it's as simple as
## cd `R RHOME`/tests
## grep '^Time elapsed' *.Rout Examples/*.Rout *.Rcheck/*.Rout
checkTimes <- function(Rhome = R.home())
{
## Purpose: Collect the "Time elapsed" timings of R's "make check-all"
## into a numeric N x 3 matrix (with rownames!)
## ----------------------------------------------------------------------
## Author: Martin Maechler, Date: 1 Mar 2004, 15:27
tDir <- file.path(Rhome,"tests")
dirLs <- c(tDir, file.path(tDir,"Examples"),
file.path(tDir, list.files(tDir, pattern="\\.Rcheck$")))
iniStr <- "^Time elapsed:"
endPat <- "\\.Rout$"
ir <- length(rr <- list())
for(d in dirLs) {
files <- list.files(d, pattern = endPat)
for(f in files) {
lls <- readLines(file.path(d,f))
if(length(i <- grep(iniStr, lls))) {
tC <- textConnection(sub(iniStr,'', lls[i]))
nCPU <- scan(tC, quiet=TRUE)
close(tC)
f <- sub(endPat,'', f)
rr[[(ir <- ir+1)]] <- list(f, nCPU[1:3])
}
}
}
## tranform list to matrix
t(matrix(sapply(rr,"[[",2), 3, length(rr),
dimnames = list(NULL, sapply(rr,"[[",1))))
}
-----------
Now I did measure on the AMD Opteron (64-bit, dual proc; 4 GB RAM)
rM <- checkTimes()
nn <- nrow(rM)
## Look at the values --- in sorted order
iS <- sort.list(rM[,1], decreasing = TRUE)
rM[iS ,]
plot(rM[iS, 3] / rM[iS,1])
## not systematically looking --> only use "CPU[1]"
plot(rM[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = "Time elapsed",
main = paste("CPU used for checks in", tDir))
rM.A <- checkTimes("/usr/local/app/R/R-devel-ATLAS-inst")
rM.G <- checkTimes("/usr/local/app/R/R-devel-GOTO-inst")
rM.s <- checkTimes("/usr/local/app/R/R-devel-inst")
iS <- sort.list(rM.A[,1], decreasing = TRUE)
cbind(ATLAS = rM.A[iS,1],
GOTO = rM.G[iS,1],
std = rM.s[iS,1])
## gives
## ATLAS GOTO std
## boot-Ex 73.38 73.71 73.62
## nlme-Ex 31.92 34.18 31.91
## mgcv-Ex 29.20 31.69 29.35
## MASS-Ex 21.54 20.49 20.29
## stats-Ex 17.80 17.69 17.91
## lattice-Ex 11.38 11.37 11.05
## methods-Ex 6.87 6.53 6.58
## base-Ex 5.48 5.28 5.26
## graphics-Ex 4.71 4.73 4.70
## tools-Ex 3.86 3.66 3.82
## cluster-Ex 3.78 3.74 3.65
## utils-Ex 2.73 2.60 2.60
## p-r-random-tests 2.60 2.58 2.55
## survival-Ex 2.48 2.49 2.30
## ...
## .........
## Graphic:
pdf("CPU-checks.pdf")
plot(rM.A[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = "Time elapsed",
main= "AMD Opteron 246: CPU for R 'make check-all' tests & Examples")
iS. <- iS[1:12]
text(1:12, rM.A[iS., 1], rownames(rM)[iS.], adj = c(-.15, -.15), cex = 0.8)
points(1:nn+.1, rM.G[iS, 1], type = 'h', col=2)
points(1:nn+.2, rM.s[iS, 1], type = 'h', col=3)
legend(par("usr")[2], par("usr")[4], c("ATLAS", "GOTO", " std "),
col=1:3, lwd=1, xjust=1.1, yjust=1.1)
if(.Device == "pdf") dev.off()
### Are ATLAS or GOTO better than "standard":
matplot(1:nn, cbind(rM.A[iS,1]/ rM.s[iS,1],
rM.G[iS,1]/ rM.s[iS,1]), type ='p', col=1:8)
abline(h = 1, lty=3, col = "gray")
## to the contrary! the points would have to be *below* 1 and are rather above
-------------------
The PDF graphic is available as
ftp://ftp.stat.math.ethz.ch/U/maechler/R/CPU-checks.pdf
---
When I however run something like the following
"non-small" lm problem,
-----------------------------------------------------------------------------
### Take a relative large model.matrix() --- as in ./predict-lm.R
### "R BATCH --vanilla <this>"
if(paste(R.version$major, R.version$minor, sep=".") >= 1.7)
RNGversion("1.6")
set.seed(47)
## Here: Want usual "noisy" model; almost no printing
n <- 5000
x <- rnorm(n)
ldat <-
data.frame(x1 = x,
x2 = sort(5*x - rnorm(n)),
f1 = factor(pmin(12, rpois(n, lam= 5))),
f2 = factor(pmin(20, rpois(n, lam= 9))),
f3 = factor(pmin(32, rpois(n, lam= 12))))
with(ldat,
ldat$y <<- 10 + 4*x1 + 2*x2 + rnorm(n) +
## no rounding here:
+ 10 * rnorm(nlevels(f1))[f1] +
+ 100* rnorm(nlevels(f2))[f2])
str(ldat)
mylm <- lm(y ~ .^2, data = ldat)
proc.time() ## (~= 100 sec on P4 1.6 GHz "lynne")
str(mm <- model.matrix(mylm))
smlm <- summary(mylm)
p1 <- predict(mylm)
p2 <- predict(mylm, type = "terms")
str(myim <- influence.measures(mylm))
## R BATCH gives another "total" proc.time() here:
-----------------------------------------------------------------------------
Things look a bit different :
Timings (the first 3 of proc.time())
-- ATLAS measured only 3x, the others 5x :
1. after lm() # grep -n '^\[1\] [^1]' lm-tst-2.Rout-opteron-*
ATLAS:
34.56 0.56 35.57
33.90 0.59 34.57
34.55 0.61 35.33
GOTO:
28.17 1.82 34.68
29.13 1.61 35.56
26.90 2.05 32.99
28.11 1.83 34.64
28.26 1.92 34.90
std:
34.61 0.62 35.62
33.46 0.61 34.26
34.79 0.65 35.58
33.78 0.67 34.62
35.49 0.70 36.37
2. total for the above R script # grep -n '^\[1\] 1' lm-tst-2.Rout-opteron-*
ATLAS:
127.71 1.56 129.92
130.42 1.66 132.28
131.89 1.39 133.57
GOTO:
129.51 25.17 212.02
129.56 26.93 215.06
137.36 27.43 221.95
139.83 28.76 226.64
137.40 27.98 221.86
std:
159.58 1.59 161.88
155.65 1.48 157.59
159.01 1.67 161.21
167.13 1.57 168.97
166.70 1.58 168.70
Which is a bit confusing to me:
The picture differs considerably if I "believe" the first
number proc.time(), say PT[1], or the third one, PT[3].
Only using PT[1] - which I usually have done -
may be quite wrong here: Contrary to ATLAS and "std",
GOTO has a difference between PT[3] and PT[1], which may be
because of the way threading and the use of the two CPUs happen:
PT[1]:
GOTO is about 20% faster than ATLAS (which is
basically the same as "standard", i.e. R-internal BLAS/LAPACK)
for the first lm() measurement,
but for the overall time {which adds summary.lm(), influence.lm() etc}
GOTO and ATLAS are basically the same speed, both 20% faster
than "standard":
PT[3]:
For the lm() part itself: no difference
For the total: ATLAS >> std >> GOTO
~~~~~~~~~~~~~~~~~~~~ (' >> ' := "clearly better than)
---
Comments welcome
Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><
More information about the R-devel
mailing list