[Rd] Internal state indicating if a data object has NAs/no NAs/not sure (Was: Re: Speeding up matrix multiplies)

Mon Aug 23 23:33:37 CEST 2010

Hi, I'm just following your messages the overhead that the code for
dealing with possible NA/NaN values brings.  When I was setting up
part of the matrixStats package, I've also though about this.  I was
thinking of having an additional logical argument 'hasNA'/'has.na'
where you as a user can specify whether there is NA/NaN:s or not,
allowing the code to avoid it or not.  Of course, if hasNA=FALSE but
there really are NAs you'll get very funny results.

Even better would be if there would be an internal state attached to
the basic data types in R indicating one of three states:

(i) Has NAs: we know there exist NA/NaN values,
(ii) Has No NAs: we know they do *not* exist,
(iii) Not Sure: we are uncertain, i.e. it may be either (i) or (ii).

This would allow one to optimize the internal code further, but if
ignored everything is backward compatible.  The default state should
be (iii), which would be backward compatible with all existing code.
For instance y <- sum(x, na.rm=FALSE) would return NA immediately if
the state of 'x' is "Has NAs".

Then different computational algorithms are free to set the state, if
they know it.  For instance, matrixStats::anyMissing(x) would flag the
"x" object (i) if returning TRUE and (ii) if returning FALSE.  Next
time it is called, the response could be instant/O(1).  Similarly, y
<- rowSums(x, na.rm=TRUE/FALSE) can set the state on y, and so on.

Just some thoughts

/Henrik

On Mon, Aug 23, 2010 at 1:50 PM, Radford Neal <radford at cs.toronto.edu> wrote:
> I've looked at the code for matrix multiplies in R, and found that it
> can speeded up quite a bit, for multiplies that are really vector dot
> products, and for other multiplies in which the result matrix is small.
>
> Here's my test program:
>
> u <- seq(0,1,length=1000)
> v <- seq(0,2,length=1000)
>
> A2 <- matrix(2.1,2,1000)
> A5 <- matrix(2.1,5,1000)
> B3 <- matrix(3.2,1000,3)
>
> A4 <- matrix(2.1,4,1000)
> B4 <- matrix(3.2,1000,4)
>
> print(system.time ({for (i in 1:100000) w <- u %*% v}))
> print(system.time ({for (i in 1:100000) w <- A5 %*% v}))
> print(system.time ({for (i in 1:100000) R <- A2 %*% B3}))
> print(system.time ({for (i in 1:100000) R <- A5 %*% B3}))
> print(system.time ({for (i in 1:100000) R <- A4 %*% B4}))
>
> Here are the five system.time results above using R 2.11.1:
>
>   user  system elapsed
>  1.680   0.000   1.683
>   user  system elapsed
>  4.350   0.000   4.349
>   user  system elapsed
>  4.820   0.000   4.818
>   user  system elapsed
>  7.370   0.020   7.383
>   user  system elapsed
>  7.990   0.000   7.991
>
> and here are the results with my modified version of R 2.11.1:
>
>   user  system elapsed
>  0.270   0.000   0.271
>   user  system elapsed
>  1.290   0.000   1.287
>   user  system elapsed
>  1.080   0.000   1.086
>   user  system elapsed
>  3.640   0.000   3.642
>   user  system elapsed
>  7.950   0.000   7.953
>
> The main issue here is again ISNAN.  The R code doesn't trust the
> Fortran routine it calls to handle NA and NaN correctly, so it checks
> whether the arguments have any NAs or NaNs, and if so does the matrix
> multiply itself, with simple nested loops.  For multiplies in which
> the number of elements in the result matrix is large, this check will
> take a negligible amount of time.  But otherwise it can drastically
> slow things down.  This is true in particular for dot products, of a
> 1xN times an Nx1 matrix, giving a 1x1 result.
>
> I changed the code to handle dot product specially, with no need for
> ISNAN checks, and otherwise to go straight to the nested loop code
> (bypassing Fortran and the ISNAN checks) if the result matrix has no
> more than 15 elements.  One can see that the times are much improved
> above, except of course for the multiply that produces a 4x4 result,
> which takes the same amount of time as before, since nothing is
> different for that.  From the results, it seems that a threshold a bit
> above 15 might be even better, though this will of course depend on
> the architecture, compiler, etc.
>
> Simon Urbaneck points out in reply to my message about speeding up
> "sum" and "prod" that the effects of changes like this depend on the
> machine architecure, compiler, etc.  This is of course true.  The
> magnitude of variation in his results (not the absolute times, but the
> relative times of different versions) is rather larger than one would
> expect, though.  Perhaps his time measurements were affected by random
> factors like system load, or perhaps they are due to non-random but
> essentially arbitrary factors like the exact alignment of code
> generated by the C compiler, which could change with even unrelated
> modifications to the program.
>
> The changes that I made seem generally desirable for any architecture,
> or will at least make things no worse (the code expansion is trivial),
> apart from the possibility of such arbitrary factors.  I'm not doing
> anything like manually unrolling loops, or replacing array indexing
> with pointer arithmetic, which might be seen as undesirable attempts
> at out-smarting the optimizer.
>
> The times above are for an Intel Linux machine with gcc version below:
>
> Using built-in specs.
> Target: i486-linux-gnu
> Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-targets=all --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu
> Thread model: posix
> gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)
>
> I didn't change any R configuration options regarding compilation.
>
> Below is my modified version of the matprod function in src/main/array.c.
>
>   Radford Neal
>
> ------------------------------------------------------------------------
>
> static void matprod(double *x, int nrx, int ncx,
>                    double *y, int nry, int ncy, double *z)
> {
>    char *transa = "N", *transb = "N";
>    int i,  j, k;
>    double one = 1.0, zero = 0.0;
>    LDOUBLE sum;
>    Rboolean do_it_here;
>
>    /* NOTE: ncx must be equal to nry. */
>
>    if (nrx==1 && ncy==1) {
>
>        /* Do dot product quickly. */
>
>        sum = 0.0;
>        for (i = 0; i<ncx; i++)
>            sum += x[i] * y[i];
>        z[0] = sum;
>
>    } else if (nrx > 0 && ncx > 0 && nry > 0 && ncy > 0) {
>
>        /* Faster to just do it here if the result is small, especially
>           considering the NA/NaN check below. */
>
>        do_it_here = nrx*ncy <= 15;
>
>        /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582
>         * The test is only O(n) here
>         */
>
>        if (!do_it_here)
>            for (i = 0; i < nrx*ncx; i++)
>                if (ISNAN(x[i])) { do_it_here = TRUE; break; }
>        if (!do_it_here)
>            for (i = 0; i < nry*ncy; i++)
>                if (ISNAN(y[i])) { do_it_here = TRUE; break; }
>
>        if (do_it_here) {
>            for (i = 0; i < nrx; i++)
>                for (k = 0; k < ncy; k++) {
>                    sum = 0.0;
>                    for (j = 0; j < ncx; j++)
>                        sum += x[i + j * nrx] * y[j + k * nry];
>                    z[i + k * nrx] = sum;
>                }
>        } else
>            F77_CALL(dgemm)(transa, transb, &nrx, &ncy, &ncx, &one,
>                            x, &nrx, y, &nry, &zero, z, &nrx);
>
>    } else /* zero-extent operations should return zeroes */
>
>        for(i = 0; i < nrx*ncy; i++) z[i] = 0;
> }
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>