[Rd] Fourteen patches to speed up R

Fri Sep 3 19:27:19 CEST 2010

I've continued to work on speeding up R, and now have a collection of
fourteen patches, some of which speed up particular functions, and
some of which reduce general interpretive overhead.  The total speed
improvement from these patches is substantial.  It varies a lot from
one R program to the next, of course, and probably from one machine to
the next, but speedups of 25% can be expected in many programs, and
sometimes much more (though sometimes less as well).

The fourteen patches work for revision r52822 of the development
version of R (I haven't check against any changes in the last few
days), and also for release 2.11.1.  These patches, along with
some documentation, are attached as speed-patches.tar.

I also wrote a number of timing test programs, which are attached as
speed-tests.tar.

I've included below the documentation on what each patch does, which
is also in "doc" in speed-patches.tar.  Note that I fixed a few minor
bugs along the way.

There looks to be scope for more improvements in various parts of the
R interpreter that I didn't get to.  I'll have to put this on hold for
now, however, to spend my time preparing for the coming teaching term.
I'd be happy to hear of any comments on these patches, though,
including information on how much they speed up typical programs, on
various machines.

   Radford Neal

-----------------------------------------------------------------------

These patches to the R source for improving speed were written by
Radford M. Neal, Sept. 2010.

See the README file for how to install them. 

Below, I describe these patches (in alphabetical order), indicate what
improvement they produce, and also mention any potential issues with
using the patch, and bugs that the patches incidently fix.

The timing improvements discussed below are what is obtained by
applying each patch individually, on an Intel system running Ubuntu
Linux with Gcc version 4.2.4.  The total improvement from all patches
is much bigger, though in a few instances a patch can diminish the
effect of another patch, by reducing the magnitude of the
inefficiencies that the other patch eliminates.  Note though, that the
percentage improvement for a given absolute improvement gets bigger as
when other patches reduce overall time.

For r52822, the total time for all tests in the accompanying speed
test suite is 674 seconds.  This is reduced to 487 seconds with all
patches applied, a reduction of 28%.  Particular R programs will, of
course, see widely varying reductions depending on what operations
they mostly do.

patch-dollar

    Speeds up access to lists, pairlists, and environments using the
    $ operator.  The speedup comes mainly from avoiding the overhead of 
    calling DispatchOrEval if there are no complexities, from passing
    on the field to extract as a symbol, or a name, or both, as available,
    and then converting only as necessary, from simplifying and inlining
    the pstrmatch procedure, and from not translating string multiple
    times.  

    Relevant timing test script:  test-dollar.r 

    This test shows about a 40% decrease in the time needed to extract
    elements of lists and environments.

    Changes unrelated to speed improvement:

    A small error-reporting bug is fixed, illustrated by the following
    output with r52822:

    > options(warnPartialMatchDollar=TRUE)
    > pl <- pairlist(abc=1,def=2)
    > pl$ab
    [1] 1
    Warning message:
    In pl$ab : partial match of 'ab' to ''

    Some code is changed at the end of R_subset3_dflt because it seems 
    to be more correct, as discussed in code comments. 

patch-evalList

    Speeds up a large number of operations by avoiding allocation of
    an extra CONS cell in the procedures for evaluating argument lists.

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 5%.

patch-fast-base

    Speeds up lookup of symbols defined in the base environment, by
    flagging symbols that have a base environment definition recorded
    in the global cache.  This allows the definition to be retrieved
    quickly without looking in the hash table.  

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 3%.

    Issue:  This patch uses the "spare" bit for the flag.  This bit is
    misnamed, since it is already used elsewhere (for closures).  It is
    possible that one of the "gp" bits should be used instead.  The
    "gp" bits should really be divided up for faster access, and so that
    their present use is apparent in the code.

    In case this use of the "spare" bit proves unwise, the patch code is 
    conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of
    envir.r.

patch-fast-spec

    Speeds up lookup of function symbols that begin with a character
    other than a letter or ".", by allowing fast bypass of non-global
    environments that do not contain (and have never contained) symbols 
    of this sort.  Since it is expected that only functions will be
    given names of this sort, the check is done only in findFun, though
    it could also be done in findVar.

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 8%.    

    Issue:  This patch uses the "spare" bit to flag environments known
    to not have symbols starting with a special character.  See remarks
    on patch-fast-base.

    In case this use of the "spare" bit proves unwise, the patch code is 
    conditional on FAST_SPEC_BYPASS being defined at the start of envir.r.

patch-for

    Speeds up for loops by not allocating new space for the loop
    variable every iteration, unless necessary.  

    Relevant timing test script:  test-for.r

    This test shows a speedup of about 5%.  

    Change unrelated to speed improvement:

    Fixes what I consider to be a bug, in which the loop clobbers a
    global variable, as demonstrated by the following output with r52822:

    > i <- 99
    > f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); }
    > f()
    [1] 1
    [1] 2
    [1] 3
    > print(i)
    [1] 3

patch-matprod

    Speeds up matrix products, including vector dot products.  The
    speed issue here is that the R code checks for any NAs, and 
    does the multiply in the matprod procedure (in array.c) if so,
    since BLAS isn't trusted with NAs.  If this check takes about
    as long as just doing the multiply in matprod, calling a BLAS
    routine makes no sense.  

    Relevant time test script:  test-matprod.r

    With no external BLAS, this patch speeds up long vector-vector 
    products by a factor of about six, matrix-vector products by a
    factor of about three, and some matrix-matrix products by a 
    factor of about two.

    Issue:  The matrix multiply code in matprod using an LDOUBLE
    (long double) variable to accumulate sums, for improved accuracy.  
    On a SPARC system I tested on, operations on long doubles are 
    vastly slower than on doubles, so that the patch produces a 
    large slowdown rather than an improvement.  This is also an issue 
    for the "sum" function, which also uses an LDOUBLE to accumulate
    the sum.  Perhaps an ordinarly double should be used in these
    places, or perhaps the configuration script should define LDOUBLE 
    as double on architectures where long doubles are extraordinarily 
    slow.

    Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the
    start of array.c will disable this patch.

patch-parens

    Speeds up parentheses by making "(" a special operator whose
    argument is not evaluated, thereby bypassing the overhead of
    evalList.  Also slightly speeds up curly brackets by inlining
    a function that is stylistically better inline anyway.

    Relevant test script:  test-parens.r

    In the parens part of test-parens.r, the speedup is about 9%.

patch-protect

    Speeds up numerous operations by making PROTECT, UNPROTECT, etc.
    be mostly macros in the files in src/main.  This takes effect
    only for files that include Defn.h after defining the symbol
    USE_FAST_PROTECT_MACROS.  With these macros, code of the form
    v = PROTECT(...) must be replaced by PROTECT(v = ...).  

    Relevant timing test scripts:  all of them, but will look at test-em.r 

    On test-em.r, the speedup from this patch is about 9%.

patch-save-alloc

    Speeds up some binary and unary arithmetic operations by, when
    possible, using the space holding one of the operands to hold
    the result, rather than allocating new space.  Though primarily
    a speed improvement, for very long vectors avoiding this allocation 
    could avoid running out of space.

    Relevant test script:  test-complex-expr.r

    On this test, the speedup is about 5% for scalar operands and about
    8% for vector operands.

    Issues:  There are some tricky issues with attributes, but I think
    I got them right.  This patch relies on NAMED being set correctly 
    in the rest of the code.  In case it isn't, the patch can be disabled 
    by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c.

patch-square

    Speeds up a^2 when a is a long vector by not checking for the
    special case of an exponent of 2 over and over again for every 
    vector element.

    Relevant test script:  test-square.r

    The time for squaring a long vector is reduced in this test by a
    factor of more than five.

patch-sum-prod

    Speeds up the "sum" and "prod" functions by not checking for NA
    when na.rm=FALSE, and other detailed code improvements.

    Relevant test script:  test-sum-prod.r

    For sum, the improvement is about a factor of 2.5 when na.rm=FALSE,
    and about 10% when na.rm=TRUE.

    Issue:  See the discussion of patch-matprod regarding LDOUBLE.
    There is no change regarding this issue due to this patch, however.

patch-transpose

    Speeds up the transpose operation (the "t" function) from detailed
    code improvements.

    Relevant test script:  test-transpose.r

    The improvement for 200x60 matrices is about a factor of two.
    There is little or no improvement for long row or column vectors.

patch-vec-arith

    Speeds up arithmetic on vectors of the same length, or when on
    vector is of length one.  This is done with detailed code improvements.

    Relevant test script:  test-vec-arith.r

    On long vectors, the +, -, and * operators are sped up by about     
    20% when operands are the same length or one operand is of length one.

    Rather mysteriously, when the operands are not length one or the
    same length, there is about a 20% increase in time required, though
    this may be due to some strange C optimizer peculiarity or some 
    strange cache effect, since the C code for this is the same as before,
    with negligible additional overhead getting to it.  Regardless, this 
    case is much less common than equal lengths or length one.

    There is little change for the / operator, which is much slower than
    +, -, or *.

patch-vec-subset

    Speeds up extraction of subsets of vectors or matrices (eg, v[10:20]
    or M[1:10,101:110]).  This is done with detailed code improvements,
    some increased fast treatment of common cases, and some avoidance of 
    unnecessary duplication.

    Relevant test script:  test-vec-subset.r

    There are lots of tests in this script.  The most dramatic improvement
    is for extracting many rows and columns of a large array, where the 
    improvement is by about a factor of four.  Extracting many rows from
    one column of a matrix is sped up by about 30%.  Extracting a large
    part of a vector is sped up by about 20%.  Several other operations 
    have improvements of 10% or more.

    Changes unrelated to speed improvement:

    Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL
    when NA_INTEGER is appropriate and where LOGICAL and INTEGER types
    are treated as interchangeable.  These cause no problems at the moment,
    but would if representations were changed.

    Issues:  The current code duplicates a vector of indexes when 
    duplication seems unnecessary.  As far as I can see, the only reason
    for this is so that it can remove attributes, which is helpful only
    for string subscripts, given how the routine to handle them returns 
    information via an attribute.  If this is the only reason, as I concluded, 
    the duplication can easily be avoided, so I avoided it.  But perhaps 
    I don't understand something, since there are a fair number of 
    interactions going on with this code.  I also removed a layer of
    procedure call overhead that seemed to be doing nothing.  Probably
    it used to do something, but no longer does, but if instead it is
    preparation for some future use, then removing it would be a mistake.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speed-patches.tar
Type: application/x-tar
Size: 133120 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20100903/f09f134f/attachment.tar>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speed-tests.tar
Type: application/x-tar
Size: 61440 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20100903/f09f134f/attachment-0001.tar>