[Rd] Some timings for 64 bit Opteron (ATLAS, GOTO, std)

Tue Mar 2 14:33:20 MET 2004

Hi Martin,

When I attended the LinuxWorld Expo in NYC back in January, I chatted with
some folks at the AMD booth, as well as guys from Penguin Computing (where
we bought our Opteron box).  I was told that the Operton has this somewhat
strange setup that the memory is controlled by one CPU.  The net effect of
this being that when both CPUs are running, one might only be running at
around 90% instead of 99%.  The `NUMA' kernel is supposed to fix this
problem.  I wonder if this is related to the performance of the threaded
GOTO lib that you saw.  Has anyone tried the NUMA kernel?

Best,
Andy

> From: Martin Maechler
> 
> >>>>> "BDR" == Prof Brian Ripley <ripley at stats.ox.ac.uk>
> >>>>>     on Fri, 27 Feb 2004 18:22:29 +0000 (GMT) writes:
> 
>     BDR> On 27 Feb 2004, Douglas Bates wrote:
>     >> Martin Maechler <maechler at stat.math.ethz.ch> writes:
>     >> 
>     >> > >>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>     >> > >>>>>     on 26 Feb 2004 15:44:16 +0100 writes:
>     >> > 
>     >> >     PD> Douglas Bates <bates at stat.wisc.edu> writes:
>     >> >     >> Have you tried configuring R with Goto's BLAS
>     >> >     >> http://www.cs.utexas.edu/users/kgoto/
>     >> >     >> 
>     >> >     >> I haven't worked with Opteron or Athlon64 
> computers but I understand
>     >> >     >> that Goto's BLAS are very effective on those 
> machines.  Furthermore
>     >> >     >> Goto's BLAS are (only) available as .so 
> libraries so you don't need to
>     >> >     >> mess with creating the .so version.
>     >> > 
>     >> >     PD> I tried it, yes. Somewhat to my surprise, it 
> seemed to be not quite as
>     >> >     PD> fast as the threaded ATLAS, but I wasn't 
> very systematic about the
>     >> >     PD> benchmarking.
>     >> > 
>     >> >     PD> (and the Goto items have license issues, 
> which get in the way for
>     >> >     PD> binary distributions.)
>     >> > 
>     >> > Thanks a lot, Peter, Brian, Doug, for your feedbacks!
>     >> > In the mean time, I have three running versions of 
> R(-devel) on
>     >> > the 64-Opteron
>     >> > - "plain"
>     >> > - linked against threaded GOTO
>     >> > - linked against threaded (static) ATLAS  (using 
> -fPIC for compilation;
>     >> > 					   "large" Rlapack)
>     >> > and I find that GOTO is faster than ATLAS
>     >> > consistently (between ~ 5-20%) for several tests
>     >> > (square matrices; %*% and solve).
>     >> > ATLAS is still an order of magnitude faster than "plain" for
>     >> > 3000x3000 matrices.
>     >> 
>     >> Would you be willing to post a brief summary of 
> comparative timings?
>     >> 
>     >> I have thought at times that it may be worthwhile collecting
>     >> comparative timings for different combinations of
>     >> processor/OS/memory size and speed/
>     >> on "typical" tasks in R.  As with any benchmark the 
> results will
>     >> artificial but they can be of some help when 
> considering what hardware
>     >> to purchase.  Bioconductor users may find it 
> particularly helpful to
>     >> be able to evaluate how much they will need to pay to 
> be able to
>     >> analyze large data sets reasonably quickly.
>     >> 
>     >> One easily-obtained timing is at the end of
>     >> $RSRC/tests/Examples/base-Ex.Rout after 'make;make check'.
> 
>     BDR> That one is I think rather too artificial, as it 
> contains few even
>     BDR> moderately large examples, and is dominated by a few 
> atypical tasks.
> 
>     BDR> I tend to use the sum of the MASS scripts as an
>     BDR> informal timing: ch06.R is also a pretty good indicator.
> 
>     BDR> I think you will find that BLAS differences are pretty
>     BDR> small in real-life analyses, or at least I always have.
> 
> I've now done a bit more systematic testing using more realistic
> code than the large-matrix (1000^2 and 3000^2)
> number crunching I did last week.
> 
> As expected, the differences disappear for VR/scripts "ch06.R"
> (there's even a slight indication of GOTO being worse than no
>  optimized BLAS, but probably that was a random fluctuation) and
> also for the "make check" outputs.
> 
> Here is a nice R function that can be used by others as well for
> getting the numbers for the "make check" (or better "make
> check-all") outputs.  Note that it's interesting to also get the
> times for the recommended packages.
> 
> #### After  "make check-all"  there are quite some files with timings
> ####         --------------
> #### Get at these
> 
> ## In a Unix shell, it's as simple as
> ##  cd `R RHOME`/tests
> ##  grep '^Time elapsed' *.Rout Examples/*.Rout *.Rcheck/*.Rout
> 
> checkTimes <- function(Rhome = R.home())
> {
>     ## Purpose: Collect the "Time elapsed" timings of R's   
> "make check-all"
>     ##          into a numeric  N x 3 matrix (with rownames!)
>     ## 
> ----------------------------------------------------------------------
>     ## Author: Martin Maechler, Date:  1 Mar 2004, 15:27
> 
>     tDir <- file.path(Rhome,"tests")
>     dirLs <- c(tDir, file.path(tDir,"Examples"),
>                file.path(tDir, list.files(tDir, 
> pattern="\\.Rcheck$")))
>     iniStr <- "^Time elapsed:"
>     endPat <- "\\.Rout$"
>     ir <- length(rr <- list())
>     for(d in dirLs) {
>         files <- list.files(d, pattern = endPat)
>         for(f in files) {
>             lls <- readLines(file.path(d,f))
>             if(length(i <- grep(iniStr, lls))) {
>                 tC <- textConnection(sub(iniStr,'', lls[i]))
>                 nCPU <- scan(tC, quiet=TRUE)
>                 close(tC)
>                 f <- sub(endPat,'', f)
>                 rr[[(ir <- ir+1)]] <- list(f, nCPU[1:3])
>             }
>         }
>     }
>     ## tranform list to matrix
>     t(matrix(sapply(rr,"[[",2), 3, length(rr),
>              dimnames = list(NULL, sapply(rr,"[[",1))))
> }
> 
> -----------
> 
> Now I did measure on the AMD Opteron (64-bit, dual proc; 4 GB RAM)
> 
> rM <- checkTimes()
> nn <- nrow(rM)
> ## Look at the values --- in sorted order
> iS <- sort.list(rM[,1], decreasing = TRUE)
> rM[iS ,]
> plot(rM[iS, 3] / rM[iS,1])
> ## not systematically looking --> only use "CPU[1]"
> plot(rM[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = 
> "Time elapsed",
>      main = paste("CPU used for checks in", tDir))
> 
> rM.A <- checkTimes("/usr/local/app/R/R-devel-ATLAS-inst")
> rM.G <- checkTimes("/usr/local/app/R/R-devel-GOTO-inst")
> rM.s <- checkTimes("/usr/local/app/R/R-devel-inst")
> iS <- sort.list(rM.A[,1], decreasing = TRUE)
> 
> cbind(ATLAS = rM.A[iS,1],
>       GOTO  = rM.G[iS,1],
>       std   = rM.s[iS,1])
> ## gives
> ##                  ATLAS  GOTO   std
> ## boot-Ex          73.38 73.71 73.62
> ## nlme-Ex          31.92 34.18 31.91
> ## mgcv-Ex          29.20 31.69 29.35
> ## MASS-Ex          21.54 20.49 20.29
> ## stats-Ex         17.80 17.69 17.91
> ## lattice-Ex       11.38 11.37 11.05
> ## methods-Ex        6.87  6.53  6.58
> ## base-Ex           5.48  5.28  5.26
> ## graphics-Ex       4.71  4.73  4.70
> ## tools-Ex          3.86  3.66  3.82
> ## cluster-Ex        3.78  3.74  3.65
> ## utils-Ex          2.73  2.60  2.60
> ## p-r-random-tests  2.60  2.58  2.55
> ## survival-Ex       2.48  2.49  2.30
> ## ...
> ## .........
> 
> ## Graphic:
> pdf("CPU-checks.pdf")
> plot(rM.A[iS, 1], type = 'h', xaxt = "n", xlab = '', ylab = 
> "Time elapsed",
>      main= "AMD Opteron 246: CPU for R 'make check-all' tests 
> & Examples")
> iS. <- iS[1:12]
> text(1:12, rM.A[iS., 1], rownames(rM)[iS.], adj = c(-.15, 
> -.15), cex = 0.8)
> points(1:nn+.1, rM.G[iS, 1], type = 'h', col=2)
> points(1:nn+.2, rM.s[iS, 1], type = 'h', col=3)
> legend(par("usr")[2], par("usr")[4], c("ATLAS", "GOTO", " std  "),
>        col=1:3, lwd=1, xjust=1.1, yjust=1.1)
> if(.Device == "pdf") dev.off()
> 
> ### Are ATLAS or GOTO better than "standard":
> matplot(1:nn, cbind(rM.A[iS,1]/ rM.s[iS,1],
>                     rM.G[iS,1]/ rM.s[iS,1]), type ='p', col=1:8)
> abline(h = 1, lty=3, col = "gray")
> ## to the contrary!  the points would have to be *below* 1 
> and are rather above
> 
> -------------------
> 
> The PDF graphic is available as 
>   ftp://ftp.stat.math.ethz.ch/U/maechler/R/CPU-checks.pdf
> 
> ---
> 
> When I however run something like the following
> "non-small" lm problem, 
> 
> --------------------------------------------------------------
> ---------------
> 
> ### Take a relative large  model.matrix() --- as in ./predict-lm.R
> ### "R BATCH --vanilla <this>"
> 
> if(paste(R.version$major, R.version$minor, sep=".") >= 1.7)
>     RNGversion("1.6")
> set.seed(47)
> 
> ## Here: Want usual "noisy" model; almost no printing
> n <- 5000
> x <- rnorm(n)
> ldat <-
>     data.frame(x1 = x,
>                x2 = sort(5*x - rnorm(n)),
>                f1 = factor(pmin(12, rpois(n, lam=  5))),
>                f2 = factor(pmin(20, rpois(n, lam=  9))),
>                f3 = factor(pmin(32, rpois(n, lam= 12))))
> with(ldat,
>      ldat$y <<- 10 + 4*x1 + 2*x2 + rnorm(n) +
>      ## no rounding here:
>      + 10 * rnorm(nlevels(f1))[f1] +
>      + 100* rnorm(nlevels(f2))[f2])
> str(ldat)
> 
> mylm <- lm(y ~ .^2, data = ldat)
> proc.time() ## (~= 100 sec on  P4 1.6 GHz "lynne")
> str(mm <- model.matrix(mylm))
> smlm <- summary(mylm)
> 
> p1 <- predict(mylm)
> p2 <- predict(mylm, type = "terms")
> 
> str(myim <- influence.measures(mylm))
> 
> ## R BATCH gives another "total"  proc.time() here:
> 
> --------------------------------------------------------------
> ---------------
> 
> Things look a bit different :
> 
> Timings (the first 3 of proc.time()) 
> 	-- ATLAS measured only 3x, the others 5x :
> 
> 1. after lm()  # grep -n '^\[1\] [^1]' lm-tst-2.Rout-opteron-*
> 
> ATLAS:
>   34.56  0.56 35.57
>   33.90  0.59 34.57
>   34.55  0.61 35.33
> GOTO:
>   28.17  1.82 34.68
>   29.13  1.61 35.56
>   26.90  2.05 32.99
>   28.11  1.83 34.64
>   28.26  1.92 34.90
> std:
>   34.61  0.62 35.62
>   33.46  0.61 34.26
>   34.79  0.65 35.58
>   33.78  0.67 34.62
>   35.49  0.70 36.37
> 
> 2. total for the above R script # grep -n '^\[1\] 1' 
> lm-tst-2.Rout-opteron-*
> 
> ATLAS:
>   127.71   1.56 129.92
>   130.42   1.66 132.28
>   131.89   1.39 133.57
> GOTO:
>   129.51  25.17 212.02
>   129.56  26.93 215.06
>   137.36  27.43 221.95
>   139.83  28.76 226.64
>   137.40  27.98 221.86
> std:
>   159.58   1.59 161.88
>   155.65   1.48 157.59
>   159.01   1.67 161.21
>   167.13   1.57 168.97
>   166.70   1.58 168.70
> 
> Which is a bit confusing to me:
>   The picture differs considerably if I "believe" the first
>   number proc.time(), say PT[1], or the third one, PT[3].
>   
>   Only using PT[1] - which I usually have done -
>   may be quite wrong here: Contrary to ATLAS and "std",
>   GOTO has a difference between PT[3] and PT[1], which may be
>   because of the way threading and the use of the two CPUs happen: 
> 
>  PT[1]:
>     GOTO is about 20% faster than ATLAS (which is
>     basically the same as "standard", i.e. R-internal BLAS/LAPACK)
>     for the first  lm() measurement,
>     but for the overall time {which adds summary.lm(), 
> influence.lm() etc}
>     GOTO and ATLAS are basically the same speed, both 20% faster
>     than "standard":
> 
>  PT[3]:
>     For the lm() part itself: no difference
>     For the total: ATLAS >> std >> GOTO  
> 		   ~~~~~~~~~~~~~~~~~~~~ (' >> ' := "clearly better than)
> 
> ---
> 
> Comments welcome
> 
> Martin Maechler <maechler at stat.math.ethz.ch>	
http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><

______________________________________________
R-devel at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}