[Rd] Fourteen patches to speed up R
luke at stat.uiowa.edu
luke at stat.uiowa.edu
Thu Sep 23 03:19:02 CEST 2010
Thanks very much for the patches. I have spent a couple of days
working through them, and others have looked at some of them as well
and may continue to do so. Here are some notes on the individual
patches describing things I have done or decided not to do and things
others have done that I know about.
patch-transpose
Applied by Martin Maechler.
patch-for
Applied by Duncan Murdoch; revised by me. Some cosmetic
revisions, including going back to PROTECT_WITH_INDEX. Also
placed two variables 'n' and 'bgn' back under volatile
declarations. Theoretically this shouldn't be needed, but gcc
-O2 -Wclobbered warns about them, so to be safe and eliminate
the warnings they are declared volatile as well.
The current byte code compiler actually stores the binding
cell rather than using setVar or defineVar -- this eliminates
the search and does not have the destructive effect of
modifying an outer variable if the loop variable is removed,
but removing the loop variable will then reference an outer
one if available or do other strange things. It doesn't
actually make much performance difference (at least in simple
examples) -- for that we would probably need to eliminate a
number of the function calls involved at the moment. A better
solution for preserving the semantics in the case of user code
removing the loop variable might be to disallow removing the
loop variable, or to allow removal to be detected easily
(e.g. by having rm() put R_UnboundValue in the value cell).
patch-parens
Should not be applied. `(` is conceptually a BUILTIN, i.e. a
strict function that evaluates all arguments in order. Making
it a SPECIAL is conceptually wrong and confuses issues of code
analysis and compilation. It is true that calling of BUILTINs
is currently less efficient than calling of SPECIALS because
the evaluated arguments for BUILTINs are stored in freshly
allocated lists, but it would be better to work towards making
that calling mechanism more efficient for all BUILTINs than to
confuse internal semantics by converting BUILTINs to SPECIALs.
We have currently a few things that are SPECIAL even though
they really have BUILTIN semantics, but they are SPECIAL
because of issues like needing to support missing arguments,
which BUILTINs do not. We should be moving these to BUILTIN
status, though perhaps not until BUILTIN calling performance
is improved.
Whether working on BUILTIN calling performance in the
interpreter makes sense depends on where the byte code
compiler gets to. The current compiler is much more efficient
about the handling of inlined BUILTINs; the revised one in
progress is likely to me much more efficient for all BUILTINs.
I would rather not make the proposed change for braces
(do_begin) as it makes it harder to find the relevant bits to
remove if we want to change this. Source references are very
useful, but we should be able to find a way of having them
without incurring runtime overhead unless they are actually
used. I have added an R_INLINE designation to getSrcref to
encourage the compiler to do the inlining. Timing differences
for test-parens.r are in the right direction but in the noise
level on an Ubuntu laptop:
inline byte
devel decl comp
curly: 10.25 10.13 1.94
parens: 11.21 10.91 1.91
The byte comp column is for the current byte code engine and
compiler and illustrates that this approach has much more
promise.
patch-sum-prod
I had looked at this a while back and had an uncommitted
change along very similar lines. I think the reason I didn't
commit this change is that I didn't like the code expansion
that resulted, and I still don't. Looking at this again it
turns out there is a very simple code change that preserves
the code structure and achieves the same improvement in the
narm == FALSE case: reverse the test order from
if (!ISNAN(x[i]) || !narm) ...
to
if (!narm || !ISNAN(x[i])) ...
That way the expensive ISNAN test is only done when the result
might matter. This has been done for real and complex sum and
prod. It provides the same level of improvement for the narm
== FALSE case as the patch, and for the narm == TRUE cases the
differences are in the noise level on my system. This has been
committed as r52925.
The specific six timings from test-sum-prod.r on my Ubuntu
laptop are
R 2.11 devel patch switch
3.25 10.27 2.50 2.47
10.69 10.39 10.25 9.99
10.73 10.53 10.37 10.02
3.80 10.77 3.60 3.57
10.57 10.93 10.32 10.89
10.54 11.07 10.45 11.09
patch-evalList
This looks fine (standard Lisp idiom) and has been applied as
r52930. Needed to add initial values for 'tail' variables to
turn off uninitialized variable warnings (r52935). Some
timings:
R R patch byte
2.11 devel evalList comp
test-em 18.49 15.13 14.08 4.34
p1 39.52 30.80 28.72 8.80
Again a compilation approach should produce much more
substantial gains for code dominated by interpreter overhead.
Here p1 is the example
p1 <- function(x) {
for (i in seq_along(x))
x[i] <- x[i] + 1
x
}
x <- rep(1, 1e7)
system.time(p1(x))
patch-square
The analysis provided with this patch needs fleshing out. It
is useful to try to understand where the speed gains come from
and to make changes that can help other code as well.
The y == 2.0 test is fairly cheap. The speed gain of the
patch comes mostly from avoiding the overhead that comes
before the y == 2.0 test, mainly the call into R_pow and two
calls to FINITE. r52937 (r52967 for 2.12) moves the test for y
== 2.0 to the top of the R_pow function, thus avoiding the
FINITE calls; this cuts the per value cost roughly in half on
one test platform at least. r52938 (r52968 for 2.12) defines
R_POW as an inline function that handles the y == 2.0 case in
line and only calls R_pow in the general case. This cuts the
time again by roughly a third.
On some platforms further improvement comes from avoiding the
overhead in mod_iterate for cases where one argument is a
scalar or the arguments are of equal length. r52965 (r52970
for 2.12) addresses this using the same approach as for
addition, etc. from patch-vec-arith; this should be abstracted
into a macro and used consistently in a few more cases.
Special handling the scalar exponent case only speeds things
up by a few percent on my laptop and one other machine and
actually slows down on a third platform (presumably a
code/optimizer interaction), though it does help some on a
fourth platform. To keep the code simpler I prefer not to
make this change now, at least until we have had a chance to
look at abstracting the iteration process into a macro.
patch-matprod
I don't have particularly strong views on this one and will
leave it to others in R-core for now. One note: on my x86_64
Ubuntu laptop replacing ISNAN with
#undef ISNAN
static R_INLINE double ISNAN(double x) { return x != x; }
produces a fairly substantial improvement. Dropping the ISNAN
test entirely speeds things up some more, and going to an
inline version of matrix multiply helps more for the smaller
cases but not much for the larger ones in the test-matprod
examples if the inline uses LDOUBLE for accumulation. It
helps in all these cases if the inline uses double. Here are
some timings that might be useful:
R inline drop inline inline
devel ISNAN ISNAN LDOUBLE double
V-V, length 1000: 8.64 4.71 3.05 1.67 1.36
M-V, 5x1000 times 1000x1: 4.95 2.60 1.61 1.80 1.44
M-V, 50x1000 times 1000x1: 3.72 1.75 0.91 2.06 1.64
M-M, 2x1000 times 1000x3: 5.72 3.60 2.87 1.57 1.16
M-M, 5x1000 times 1000x3: 8.99 5.87 4.55 5.13 4.03
M-M, 10x1000 times 1000x10: 10.05 7.71 6.75 7.61 5.96
M-M, 10x1000 times 1000x11: 10.87 8.40 7.38 8.35 6.51
On one of our lab machines, where we are still running 2.10.1
but also have MKL BLAS versions available I got
R MKL MKL
2.10.1 seq thread
V-V, length 1000: 6.265 5.250 5.255
M-V, 5x1000 times 1000x1: 3.197 2.832 2.837
M-V, 50x1000 times 1000x1: 2.525 2.260 2.263
M-M, 2x1000 times 1000x3: 3.478 2.654 2.648
M-M, 5x1000 times 1000x3: 5.598 4.258 4.272
M-M, 10x1000 times 1000x10: 6.693 3.749 3.796
M-M, 10x1000 times 1000x11: 7.221 3.981 4.042
patch-fast-base
patch-fast-spec
I tried the patch-fast-spec patch and did not see consistent
performance improvement -- slightly faster on one example,
slightly slower on another. So it is not at all clear to me
that this provides any real benefit. The code is certainly
made more complex, so I do not think these should be applied.
Optimizing access to base functions in general and operators
in particular is one of the things a byte code compiler will
do, at least at reasonable optimization levels. The current
byte code compiler speeds up the EM example by about a factor
of three; the revised one I am working on will hopefully do
even better.
R fast byte
devel spec code
em 13.83 12.75 4.28
p1 28.30 29.91 9.04
patch-vec-arith
Looks OK. Applied to trunk as r52946 (r52969 in 2-12-branch).
Eventually it may make sense to revisit this and maybe use a
macro to abstract out common code and apply it to some other
cases as well.
patch-save-alloc
This is similar to something I have been experimenting with in
the context of byte code compilation. In that setting there is
more opportunity for optimization by looking at where the
result of a computation is being used and possibly overwriting
the target. I'm not sure this is worth doing in the
interpreter, and my timings give somewhat mixed results (that
also vary for non-obvious reasons with seemingly small code
changes). I would prefer to defer committing to this idea in
the interpreter until more is learned from the byte code
experiments and about exactly where gains, if any, might be
coming from.
patch-vec-subset
patch-dollar
I would prefer if someone in R-core who is more familiar with
the subsetting/dollar code than I am could have a look at
these.
patch-protect
As I mentioned in my initial reply, I've tried this before on
the theory that it should make a difference, but it didn't
then and I still doesn't now, at least not relative to the
noise level on my machines on the tests I ran. So I don't
think this is worth doing now, but it is worth keeping in mind
and trying again as other factors improve.
luke
On Fri, 3 Sep 2010, Radford Neal wrote:
> I've continued to work on speeding up R, and now have a collection of
> fourteen patches, some of which speed up particular functions, and
> some of which reduce general interpretive overhead. The total speed
> improvement from these patches is substantial. It varies a lot from
> one R program to the next, of course, and probably from one machine to
> the next, but speedups of 25% can be expected in many programs, and
> sometimes much more (though sometimes less as well).
>
> The fourteen patches work for revision r52822 of the development
> version of R (I haven't check against any changes in the last few
> days), and also for release 2.11.1. These patches, along with
> some documentation, are attached as speed-patches.tar.
>
> I also wrote a number of timing test programs, which are attached as
> speed-tests.tar.
>
> I've included below the documentation on what each patch does, which
> is also in "doc" in speed-patches.tar. Note that I fixed a few minor
> bugs along the way.
>
> There looks to be scope for more improvements in various parts of the
> R interpreter that I didn't get to. I'll have to put this on hold for
> now, however, to spend my time preparing for the coming teaching term.
> I'd be happy to hear of any comments on these patches, though,
> including information on how much they speed up typical programs, on
> various machines.
>
> Radford Neal
>
> -----------------------------------------------------------------------
>
> These patches to the R source for improving speed were written by
> Radford M. Neal, Sept. 2010.
>
> See the README file for how to install them.
>
> Below, I describe these patches (in alphabetical order), indicate what
> improvement they produce, and also mention any potential issues with
> using the patch, and bugs that the patches incidently fix.
>
> The timing improvements discussed below are what is obtained by
> applying each patch individually, on an Intel system running Ubuntu
> Linux with Gcc version 4.2.4. The total improvement from all patches
> is much bigger, though in a few instances a patch can diminish the
> effect of another patch, by reducing the magnitude of the
> inefficiencies that the other patch eliminates. Note though, that the
> percentage improvement for a given absolute improvement gets bigger as
> when other patches reduce overall time.
>
> For r52822, the total time for all tests in the accompanying speed
> test suite is 674 seconds. This is reduced to 487 seconds with all
> patches applied, a reduction of 28%. Particular R programs will, of
> course, see widely varying reductions depending on what operations
> they mostly do.
>
> patch-dollar
>
> Speeds up access to lists, pairlists, and environments using the
> $ operator. The speedup comes mainly from avoiding the overhead of
> calling DispatchOrEval if there are no complexities, from passing
> on the field to extract as a symbol, or a name, or both, as available,
> and then converting only as necessary, from simplifying and inlining
> the pstrmatch procedure, and from not translating string multiple
> times.
>
> Relevant timing test script: test-dollar.r
>
> This test shows about a 40% decrease in the time needed to extract
> elements of lists and environments.
>
> Changes unrelated to speed improvement:
>
> A small error-reporting bug is fixed, illustrated by the following
> output with r52822:
>
> > options(warnPartialMatchDollar=TRUE)
> > pl <- pairlist(abc=1,def=2)
> > pl$ab
> [1] 1
> Warning message:
> In pl$ab : partial match of 'ab' to ''
>
> Some code is changed at the end of R_subset3_dflt because it seems
> to be more correct, as discussed in code comments.
>
> patch-evalList
>
> Speeds up a large number of operations by avoiding allocation of
> an extra CONS cell in the procedures for evaluating argument lists.
>
> Relevant timing test scripts: all of them, but will look at test-em.r
>
> On test-em.r, the speedup from this patch is about 5%.
>
> patch-fast-base
>
> Speeds up lookup of symbols defined in the base environment, by
> flagging symbols that have a base environment definition recorded
> in the global cache. This allows the definition to be retrieved
> quickly without looking in the hash table.
>
> Relevant timing test scripts: all of them, but will look at test-em.r
>
> On test-em.r, the speedup from this patch is about 3%.
>
> Issue: This patch uses the "spare" bit for the flag. This bit is
> misnamed, since it is already used elsewhere (for closures). It is
> possible that one of the "gp" bits should be used instead. The
> "gp" bits should really be divided up for faster access, and so that
> their present use is apparent in the code.
>
> In case this use of the "spare" bit proves unwise, the patch code is
> conditional on FAST_BASE_CACHE_LOOKUP being defined at the start of
> envir.r.
>
> patch-fast-spec
>
> Speeds up lookup of function symbols that begin with a character
> other than a letter or ".", by allowing fast bypass of non-global
> environments that do not contain (and have never contained) symbols
> of this sort. Since it is expected that only functions will be
> given names of this sort, the check is done only in findFun, though
> it could also be done in findVar.
>
> Relevant timing test scripts: all of them, but will look at test-em.r
>
> On test-em.r, the speedup from this patch is about 8%.
>
> Issue: This patch uses the "spare" bit to flag environments known
> to not have symbols starting with a special character. See remarks
> on patch-fast-base.
>
> In case this use of the "spare" bit proves unwise, the patch code is
> conditional on FAST_SPEC_BYPASS being defined at the start of envir.r.
>
> patch-for
>
> Speeds up for loops by not allocating new space for the loop
> variable every iteration, unless necessary.
>
> Relevant timing test script: test-for.r
>
> This test shows a speedup of about 5%.
>
> Change unrelated to speed improvement:
>
> Fixes what I consider to be a bug, in which the loop clobbers a
> global variable, as demonstrated by the following output with r52822:
>
> > i <- 99
> > f <- function () for (i in 1:3) { print(i); if (i==2) rm(i); }
> > f()
> [1] 1
> [1] 2
> [1] 3
> > print(i)
> [1] 3
>
> patch-matprod
>
> Speeds up matrix products, including vector dot products. The
> speed issue here is that the R code checks for any NAs, and
> does the multiply in the matprod procedure (in array.c) if so,
> since BLAS isn't trusted with NAs. If this check takes about
> as long as just doing the multiply in matprod, calling a BLAS
> routine makes no sense.
>
> Relevant time test script: test-matprod.r
>
> With no external BLAS, this patch speeds up long vector-vector
> products by a factor of about six, matrix-vector products by a
> factor of about three, and some matrix-matrix products by a
> factor of about two.
>
> Issue: The matrix multiply code in matprod using an LDOUBLE
> (long double) variable to accumulate sums, for improved accuracy.
> On a SPARC system I tested on, operations on long doubles are
> vastly slower than on doubles, so that the patch produces a
> large slowdown rather than an improvement. This is also an issue
> for the "sum" function, which also uses an LDOUBLE to accumulate
> the sum. Perhaps an ordinarly double should be used in these
> places, or perhaps the configuration script should define LDOUBLE
> as double on architectures where long doubles are extraordinarily
> slow.
>
> Due to this issue, not defining MATPROD_CAN_BE_DONE_HERE at the
> start of array.c will disable this patch.
>
> patch-parens
>
> Speeds up parentheses by making "(" a special operator whose
> argument is not evaluated, thereby bypassing the overhead of
> evalList. Also slightly speeds up curly brackets by inlining
> a function that is stylistically better inline anyway.
>
> Relevant test script: test-parens.r
>
> In the parens part of test-parens.r, the speedup is about 9%.
>
> patch-protect
>
> Speeds up numerous operations by making PROTECT, UNPROTECT, etc.
> be mostly macros in the files in src/main. This takes effect
> only for files that include Defn.h after defining the symbol
> USE_FAST_PROTECT_MACROS. With these macros, code of the form
> v = PROTECT(...) must be replaced by PROTECT(v = ...).
>
> Relevant timing test scripts: all of them, but will look at test-em.r
>
> On test-em.r, the speedup from this patch is about 9%.
>
> patch-save-alloc
>
> Speeds up some binary and unary arithmetic operations by, when
> possible, using the space holding one of the operands to hold
> the result, rather than allocating new space. Though primarily
> a speed improvement, for very long vectors avoiding this allocation
> could avoid running out of space.
>
> Relevant test script: test-complex-expr.r
>
> On this test, the speedup is about 5% for scalar operands and about
> 8% for vector operands.
>
> Issues: There are some tricky issues with attributes, but I think
> I got them right. This patch relies on NAMED being set correctly
> in the rest of the code. In case it isn't, the patch can be disabled
> by not defining AVOID_ALLOC_IF_POSSIBLE at the top of arithmetic.c.
>
> patch-square
>
> Speeds up a^2 when a is a long vector by not checking for the
> special case of an exponent of 2 over and over again for every
> vector element.
>
> Relevant test script: test-square.r
>
> The time for squaring a long vector is reduced in this test by a
> factor of more than five.
>
> patch-sum-prod
>
> Speeds up the "sum" and "prod" functions by not checking for NA
> when na.rm=FALSE, and other detailed code improvements.
>
> Relevant test script: test-sum-prod.r
>
> For sum, the improvement is about a factor of 2.5 when na.rm=FALSE,
> and about 10% when na.rm=TRUE.
>
> Issue: See the discussion of patch-matprod regarding LDOUBLE.
> There is no change regarding this issue due to this patch, however.
>
> patch-transpose
>
> Speeds up the transpose operation (the "t" function) from detailed
> code improvements.
>
> Relevant test script: test-transpose.r
>
> The improvement for 200x60 matrices is about a factor of two.
> There is little or no improvement for long row or column vectors.
>
> patch-vec-arith
>
> Speeds up arithmetic on vectors of the same length, or when on
> vector is of length one. This is done with detailed code improvements.
>
> Relevant test script: test-vec-arith.r
>
> On long vectors, the +, -, and * operators are sped up by about
> 20% when operands are the same length or one operand is of length one.
>
> Rather mysteriously, when the operands are not length one or the
> same length, there is about a 20% increase in time required, though
> this may be due to some strange C optimizer peculiarity or some
> strange cache effect, since the C code for this is the same as before,
> with negligible additional overhead getting to it. Regardless, this
> case is much less common than equal lengths or length one.
>
> There is little change for the / operator, which is much slower than
> +, -, or *.
>
> patch-vec-subset
>
> Speeds up extraction of subsets of vectors or matrices (eg, v[10:20]
> or M[1:10,101:110]). This is done with detailed code improvements,
> some increased fast treatment of common cases, and some avoidance of
> unnecessary duplication.
>
> Relevant test script: test-vec-subset.r
>
> There are lots of tests in this script. The most dramatic improvement
> is for extracting many rows and columns of a large array, where the
> improvement is by about a factor of four. Extracting many rows from
> one column of a matrix is sped up by about 30%. Extracting a large
> part of a vector is sped up by about 20%. Several other operations
> have improvements of 10% or more.
>
> Changes unrelated to speed improvement:
>
> Fixes two latent bugs where the code incorrectly refers to NA_LOGICAL
> when NA_INTEGER is appropriate and where LOGICAL and INTEGER types
> are treated as interchangeable. These cause no problems at the moment,
> but would if representations were changed.
>
> Issues: The current code duplicates a vector of indexes when
> duplication seems unnecessary. As far as I can see, the only reason
> for this is so that it can remove attributes, which is helpful only
> for string subscripts, given how the routine to handle them returns
> information via an attribute. If this is the only reason, as I concluded,
> the duplication can easily be avoided, so I avoided it. But perhaps
> I don't understand something, since there are a fair number of
> interactions going on with this code. I also removed a layer of
> procedure call overhead that seemed to be doing nothing. Probably
> it used to do something, but no longer does, but if instead it is
> preparation for some future use, then removing it would be a mistake.
>
--
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke at stat.uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
More information about the R-devel
mailing list