[R] crossprod(x) vs crossprod(x,x)

Tue Nov 21 16:24:41 CET 2006

On 11/21/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
> On Tue, 21 Nov 2006, Duncan Murdoch wrote:
>
> > On 11/21/2006 6:39 AM, Robin Hankin wrote:
> >> Professor Ripley
> >>
> >> thanks for this.
> >>
> >> I see what you say about gc().  I'll try your looped test on my system:
> >>
> >>
> >>
> >>  > x <- matrix(rnorm(3000000),ncol=3)
> >>  >
> >>  > system.time(for(i in 1:100){crossprod(x)})
> >> [1]  7.931 20.420 28.619  0.000  0.000
> >>  > system.time(for(i in 1:100){crossprod(x)})
> >> [1]  7.975 20.590 29.512  0.000  0.000
> >>  > system.time(for(i in 1:100){crossprod(x)})
> >> [1]  8.003 20.641 30.160  0.000  0.000
> >>  > system.time(for(i in 1:100){crossprod(x,x)})
> >> [1] 2.350 0.032 2.552 0.000 0.000
> >>  > system.time(for(i in 1:100){crossprod(x,x)})
> >> [1] 2.330 0.015 2.333 0.000 0.000
> >>  > system.time(for(i in 1:100){crossprod(x,x)})
> >> [1] 2.333 0.034 2.447 0.000 0.000
> >>  >
> >>
> >>
> >>
> >> How come my findings are qualitatively different from yours?
> >
> > Have you said what version of R you're using?  I tried your code in an Intel
> > Mac with R 2.4.0, and a Windows machine with R-devel, and saw what Brian saw.
> > Maybe the Power Mac changes this, but that doesn't make a lot of sense.  On
> > the other hand, it could be that an older R release is making unnecessary
> > copies in one case but not the other.
>
> I think this indicates a problem with the BLAS on the PPC MacOS, which is
> very different from the Intel one, AFAIK.  Nothing in R itself could be
> directly responsible for that system-time usage, since R is a user
> process.

A similar problem can occur on an Intel Mac.  The underlying problem
is an interaction between the accelerated BLAS and the Mac OS X
version of malloc.  The threshold on the size of requested memory
chunk that determines when a call to malloc triggers a system call has
not been updated for modern hardware.  Under certain circumstances an
accelerated BLAS like vecLib can end up causing many calls to malloc
for memory chunks that are just over the threshold.  This results in
poor performance (as seen above) on single-core machines and even
worse performance on dual-core machines.

On a Macbook Pro (dual-core Intel) with the default installation of
R-2.4.0 I get

> library(lme4)
Loading required package: Matrix
Loading required package: lattice
> data(star, package = "mlmRev")
> system.time(fm1 <- lmer(math ~ sx*eth+gr+cltype+(yrs|id)+(1|tch)+(yrs|sch),
+       star, control = list(gradient = FALSE, niterEM = 0)))
[1] 112.818 134.595 248.404   0.000   0.000

If I switch the BLAS from vecLib, which is multi-threaded, to R's
internal BLAS, which is single-threaded, I get

> system.time(fm1 <- lmer(math ~ sx*eth+gr+cltype+(yrs|id)+(1|tch)+(yrs|sch),
+    star, control = list(niterEM = 0, gradient = FALSE)))
[1] 50.489  8.311 59.172  0.000  0.000

On an Intel Mac running the binary distribution of R-2.4.0 you switch
from vecLib BLAS to R's internal BLAS by

$ cd `R RHOME`/lib
$ sudo ln -sf libRblas.0.dylib libRblas.dylib
Password:

I assume the process on a PPC Mac is similar.

I don't think there is anything that can be done about this until
Apple releases a new version of the operating system which, we hope,
will update the code in malloc.  At least I don't think we can change
the linkage for malloc from vecLib (Simon may be able to tell us) so
Robin's choices are to use vecLib BLAS and get a speed boost on some
computations but horrible performance on others or to switch to R's
internal BLAS and get reasonable but not great performance all the
time.