[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

Martin Maechler maechler at stat.math.ethz.ch
Tue Mar 2 11:55:19 MET 2004

>>>>> "BDR" == Prof Brian Ripley <ripley at stats.ox.ac.uk>
>>>>>     on Fri, 27 Feb 2004 18:22:29 +0000 (GMT) writes:

    BDR> On 27 Feb 2004, Douglas Bates wrote:
    >> Martin Maechler <maechler at stat.math.ethz.ch> writes:
    >> > >>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
    >> > >>>>>     on 26 Feb 2004 15:44:16 +0100 writes:
    >> > 
    >> >     PD> Douglas Bates <bates at stat.wisc.edu> writes:
    >> >     >> Have you tried configuring R with Goto's BLAS
    >> >     >> http://www.cs.utexas.edu/users/kgoto/
    >> >     >> 
    >> >     >> I haven't worked with Opteron or Athlon64 computers but I understand
    >> >     >> that Goto's BLAS are very effective on those machines.  Furthermore
    >> >     >> Goto's BLAS are (only) available as .so libraries so you don't need to
    >> >     >> mess with creating the .so version.
    >> > 
    >> >     PD> I tried it, yes. Somewhat to my surprise, it seemed to be not quite as
    >> >     PD> fast as the threaded ATLAS, but I wasn't very systematic about the
    >> >     PD> benchmarking.
    >> > 
    >> >     PD> (and the Goto items have license issues, which get in the way for
    >> >     PD> binary distributions.)
    >> > 
    >> > Thanks a lot, Peter, Brian, Doug, for your feedbacks!
    >> > In the mean time, I have three running versions of R(-devel) on
    >> > the 64-Opteron
    >> > - "plain"
    >> > - linked against threaded GOTO
    >> > - linked against threaded (static) ATLAS  (using -fPIC for compilation;
    >> > 					   "large" Rlapack)
    >> > and I find that GOTO is faster than ATLAS
    >> > consistently (between ~ 5-20%) for several tests
    >> > (square matrices; %*% and solve).
    >> > ATLAS is still an order of magnitude faster than "plain" for
    >> > 3000x3000 matrices.
    >> Would you be willing to post a brief summary of comparative timings?
    >> I have thought at times that it may be worthwhile collecting
    >> comparative timings for different combinations of
    >> processor/OS/memory size and speed/
    >> on "typical" tasks in R.  As with any benchmark the results will
    >> artificial but they can be of some help when considering what hardware
    >> to purchase.  Bioconductor users may find it particularly helpful to
    >> be able to evaluate how much they will need to pay to be able to
    >> analyze large data sets reasonably quickly.
    >> One easily-obtained timing is at the end of
    >> $RSRC/tests/Examples/base-Ex.Rout after 'make;make check'.

    BDR> That one is I think rather too artificial, as it contains few even
    BDR> moderately large examples, and is dominated by a few atypical tasks.

    BDR> I tend to use the sum of the MASS scripts as an
    BDR> informal timing: ch06.R is also a pretty good indicator.

    BDR> I think you will find that BLAS differences are pretty
    BDR> small in real-life analyses, or at least I always have.

I've now done a bit more systematic testing using more realistic
code than the large-matrix (1000^2 and 3000^2)
number crunching I did last week.

As expected, the differences disappear for VR/scripts "ch06.R"
(there's even a slight indication of GOTO being worse than no
 optimized BLAS, but probably that was a random fluctuation) and
also for the "make check" outputs.

Here is a nice R function that can be used by others as well for
getting the numbers for the "make check" (or better "make
check-all") outputs.  Note that it's interesting to also get the
times for the recommended packages.

#### After  "make check-all"  there are quite some files with timings
####         --------------
#### Get at these

## In a Unix shell, it's as simple as
##  cd `R RHOME`/tests
##  grep '^Time elapsed' *.Rout Examples/*.Rout *.Rcheck/*.Rout

checkTimes <- function(Rhome = R.home())
    ## Purpose: Collect the "Time elapsed" timings of R's   "make check-all"
    ##          into a numeric  N x 3 matrix (with rownames!)
    ## ----------------------------------------------------------------------
    ## Author: Martin Maechler, Date:  1 Mar 2004, 15:27

    tDir <- file.path(Rhome,"tests")
    dirLs <- c(tDir, file.path(tDir,"Examples"),
               file.path(tDir, list.files(tDir, pattern="\\.Rcheck$")))
    iniStr <- "^Time elapsed:"
    endPat <- "\\.Rout$"
    ir <- length(rr <- list())
    for(d in dirLs) {
        files <- list.files(d, pattern = endPat)
        for(f in files) {
            lls <- readLines(file.path(d,f))
            if(length(i <- grep(iniStr, lls))) {
                tC <- textConnection(sub(iniStr,'', lls[i]))
                nCPU <- scan(tC, quiet=TRUE)
                f <- sub(endPat,'', f)
                rr[[(ir <- ir+1)]] <- list(f, nCPU[1:3])
    ## tranform list to matrix
    t(matrix(sapply(rr,"[[",2), 3, length(rr),
             dimnames = list(NULL, sapply(rr,"[[",1))))


Now I did measure on the AMD Opteron (64-bit, dual proc; 4 GB RAM)

rM <- checkTimes()
nn <- nrow(rM)
## Look at the values --- in sorted order
iS <- sort.list(rM[,1], decreasing = TRUE)
rM[iS ,]
plot(rM[iS, 3] / rM[iS,1])
## not systematically looking --> only use "CPU[1]"
plot(rM[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = "Time elapsed",
     main = paste("CPU used for checks in", tDir))

rM.A <- checkTimes("/usr/local/app/R/R-devel-ATLAS-inst")
rM.G <- checkTimes("/usr/local/app/R/R-devel-GOTO-inst")
rM.s <- checkTimes("/usr/local/app/R/R-devel-inst")
iS <- sort.list(rM.A[,1], decreasing = TRUE)

cbind(ATLAS = rM.A[iS,1],
      GOTO  = rM.G[iS,1],
      std   = rM.s[iS,1])
## gives
##                  ATLAS  GOTO   std
## boot-Ex          73.38 73.71 73.62
## nlme-Ex          31.92 34.18 31.91
## mgcv-Ex          29.20 31.69 29.35
## MASS-Ex          21.54 20.49 20.29
## stats-Ex         17.80 17.69 17.91
## lattice-Ex       11.38 11.37 11.05
## methods-Ex        6.87  6.53  6.58
## base-Ex           5.48  5.28  5.26
## graphics-Ex       4.71  4.73  4.70
## tools-Ex          3.86  3.66  3.82
## cluster-Ex        3.78  3.74  3.65
## utils-Ex          2.73  2.60  2.60
## p-r-random-tests  2.60  2.58  2.55
## survival-Ex       2.48  2.49  2.30
## ...
## .........

## Graphic:
plot(rM.A[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = "Time elapsed",
     main= "AMD Opteron 246: CPU for R 'make check-all' tests & Examples")
iS. <- iS[1:12]
text(1:12, rM.A[iS., 1], rownames(rM)[iS.], adj = c(-.15, -.15), cex = 0.8)
points(1:nn+.1, rM.G[iS, 1], type = 'h', col=2)
points(1:nn+.2, rM.s[iS, 1], type = 'h', col=3)
legend(par("usr")[2], par("usr")[4], c("ATLAS", "GOTO", " std  "),
       col=1:3, lwd=1, xjust=1.1, yjust=1.1)
if(.Device == "pdf") dev.off()

### Are ATLAS or GOTO better than "standard":
matplot(1:nn, cbind(rM.A[iS,1]/ rM.s[iS,1],
                    rM.G[iS,1]/ rM.s[iS,1]), type ='p', col=1:8)
abline(h = 1, lty=3, col = "gray")
## to the contrary!  the points would have to be *below* 1 and are rather above


The PDF graphic is available as 


When I however run something like the following
"non-small" lm problem, 


### Take a relative large  model.matrix() --- as in ./predict-lm.R
### "R BATCH --vanilla <this>"

if(paste(R.version$major, R.version$minor, sep=".") >= 1.7)

## Here: Want usual "noisy" model; almost no printing
n <- 5000
x <- rnorm(n)
ldat <-
    data.frame(x1 = x,
               x2 = sort(5*x - rnorm(n)),
               f1 = factor(pmin(12, rpois(n, lam=  5))),
               f2 = factor(pmin(20, rpois(n, lam=  9))),
               f3 = factor(pmin(32, rpois(n, lam= 12))))
     ldat$y <<- 10 + 4*x1 + 2*x2 + rnorm(n) +
     ## no rounding here:
     + 10 * rnorm(nlevels(f1))[f1] +
     + 100* rnorm(nlevels(f2))[f2])

mylm <- lm(y ~ .^2, data = ldat)
proc.time() ## (~= 100 sec on  P4 1.6 GHz "lynne")
str(mm <- model.matrix(mylm))
smlm <- summary(mylm)

p1 <- predict(mylm)
p2 <- predict(mylm, type = "terms")

str(myim <- influence.measures(mylm))

## R BATCH gives another "total"  proc.time() here:


Things look a bit different :

Timings (the first 3 of proc.time()) 
	-- ATLAS measured only 3x, the others 5x :

1. after lm()  # grep -n '^\[1\] [^1]' lm-tst-2.Rout-opteron-*

  34.56  0.56 35.57
  33.90  0.59 34.57
  34.55  0.61 35.33
  28.17  1.82 34.68
  29.13  1.61 35.56
  26.90  2.05 32.99
  28.11  1.83 34.64
  28.26  1.92 34.90
  34.61  0.62 35.62
  33.46  0.61 34.26
  34.79  0.65 35.58
  33.78  0.67 34.62
  35.49  0.70 36.37

2. total for the above R script # grep -n '^\[1\] 1' lm-tst-2.Rout-opteron-*

  127.71   1.56 129.92
  130.42   1.66 132.28
  131.89   1.39 133.57
  129.51  25.17 212.02
  129.56  26.93 215.06
  137.36  27.43 221.95
  139.83  28.76 226.64
  137.40  27.98 221.86
  159.58   1.59 161.88
  155.65   1.48 157.59
  159.01   1.67 161.21
  167.13   1.57 168.97
  166.70   1.58 168.70

Which is a bit confusing to me:
  The picture differs considerably if I "believe" the first
  number proc.time(), say PT[1], or the third one, PT[3].
  Only using PT[1] - which I usually have done -
  may be quite wrong here: Contrary to ATLAS and "std",
  GOTO has a difference between PT[3] and PT[1], which may be
  because of the way threading and the use of the two CPUs happen: 

    GOTO is about 20% faster than ATLAS (which is
    basically the same as "standard", i.e. R-internal BLAS/LAPACK)
    for the first  lm() measurement,
    but for the overall time {which adds summary.lm(), influence.lm() etc}
    GOTO and ATLAS are basically the same speed, both 20% faster
    than "standard":

    For the lm() part itself: no difference
    For the total: ATLAS >> std >> GOTO  
		   ~~~~~~~~~~~~~~~~~~~~ (' >> ' := "clearly better than)


Comments welcome

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><

