[Rd] Speeding up matrix multiplies

Mon Aug 23 22:50:07 CEST 2010

I've looked at the code for matrix multiplies in R, and found that it
can speeded up quite a bit, for multiplies that are really vector dot
products, and for other multiplies in which the result matrix is small.

Here's my test program:

u <- seq(0,1,length=1000)
v <- seq(0,2,length=1000)

A2 <- matrix(2.1,2,1000)
A5 <- matrix(2.1,5,1000)
B3 <- matrix(3.2,1000,3)

A4 <- matrix(2.1,4,1000)
B4 <- matrix(3.2,1000,4)

print(system.time ({for (i in 1:100000) w <- u %*% v}))
print(system.time ({for (i in 1:100000) w <- A5 %*% v}))
print(system.time ({for (i in 1:100000) R <- A2 %*% B3}))
print(system.time ({for (i in 1:100000) R <- A5 %*% B3}))
print(system.time ({for (i in 1:100000) R <- A4 %*% B4}))

Here are the five system.time results above using R 2.11.1:

   user  system elapsed
  1.680   0.000   1.683
   user  system elapsed
  4.350   0.000   4.349
   user  system elapsed
  4.820   0.000   4.818
   user  system elapsed
  7.370   0.020   7.383
   user  system elapsed
  7.990   0.000   7.991

and here are the results with my modified version of R 2.11.1:

   user  system elapsed
  0.270   0.000   0.271
   user  system elapsed
  1.290   0.000   1.287
   user  system elapsed
  1.080   0.000   1.086
   user  system elapsed
  3.640   0.000   3.642
   user  system elapsed
  7.950   0.000   7.953

The main issue here is again ISNAN.  The R code doesn't trust the
Fortran routine it calls to handle NA and NaN correctly, so it checks
whether the arguments have any NAs or NaNs, and if so does the matrix
multiply itself, with simple nested loops.  For multiplies in which
the number of elements in the result matrix is large, this check will
take a negligible amount of time.  But otherwise it can drastically
slow things down.  This is true in particular for dot products, of a
1xN times an Nx1 matrix, giving a 1x1 result.

I changed the code to handle dot product specially, with no need for
ISNAN checks, and otherwise to go straight to the nested loop code
(bypassing Fortran and the ISNAN checks) if the result matrix has no
more than 15 elements.  One can see that the times are much improved
above, except of course for the multiply that produces a 4x4 result,
which takes the same amount of time as before, since nothing is
different for that.  From the results, it seems that a threshold a bit
above 15 might be even better, though this will of course depend on
the architecture, compiler, etc.

Simon Urbaneck points out in reply to my message about speeding up
"sum" and "prod" that the effects of changes like this depend on the
machine architecure, compiler, etc.  This is of course true.  The
magnitude of variation in his results (not the absolute times, but the
relative times of different versions) is rather larger than one would
expect, though.  Perhaps his time measurements were affected by random
factors like system load, or perhaps they are due to non-random but
essentially arbitrary factors like the exact alignment of code
generated by the C compiler, which could change with even unrelated
modifications to the program.

The changes that I made seem generally desirable for any architecture,
or will at least make things no worse (the code expansion is trivial),
apart from the possibility of such arbitrary factors.  I'm not doing
anything like manually unrolling loops, or replacing array indexing
with pointer arithmetic, which might be seen as undesirable attempts
at out-smarting the optimizer.

The times above are for an Intel Linux machine with gcc version below:

Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-targets=all --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu
Thread model: posix
gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)

I didn't change any R configuration options regarding compilation.

Below is my modified version of the matprod function in src/main/array.c.

   Radford Neal

------------------------------------------------------------------------

static void matprod(double *x, int nrx, int ncx,
		    double *y, int nry, int ncy, double *z)
{
    char *transa = "N", *transb = "N";
    int i,  j, k;
    double one = 1.0, zero = 0.0;
    LDOUBLE sum;
    Rboolean do_it_here;

    /* NOTE: ncx must be equal to nry. */

    if (nrx==1 && ncy==1) {

        /* Do dot product quickly. */

        sum = 0.0;
        for (i = 0; i<ncx; i++)
            sum += x[i] * y[i];
        z[0] = sum;

    } else if (nrx > 0 && ncx > 0 && nry > 0 && ncy > 0) {

        /* Faster to just do it here if the result is small, especially
           considering the NA/NaN check below. */

        do_it_here = nrx*ncy <= 15;

	/* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
	 * The test is only O(n) here
	 */

        if (!do_it_here)
            for (i = 0; i < nrx*ncx; i++)
	        if (ISNAN(x[i])) { do_it_here = TRUE; break; }
	if (!do_it_here)
	    for (i = 0; i < nry*ncy; i++)
		if (ISNAN(y[i])) { do_it_here = TRUE; break; }

	if (do_it_here) {
	    for (i = 0; i < nrx; i++)
		for (k = 0; k < ncy; k++) {
		    sum = 0.0;
		    for (j = 0; j < ncx; j++)
			sum += x[i + j * nrx] * y[j + k * nry];
		    z[i + k * nrx] = sum;
		}
	} else
	    F77_CALL(dgemm)(transa, transb, &nrx, &ncy, &ncx, &one,
			    x, &nrx, y, &nry, &zero, z, &nrx);

    } else /* zero-extent operations should return zeroes */

	for(i = 0; i < nrx*ncy; i++) z[i] = 0;
}