---
title: Performance
author: Hadley Wickham --- very partly modified by Martin Maechler
layout: default
---
# Performance
```{r, echo = FALSE}
library(microbenchmark)
options(digits = 3)
options("microbenchmark.unit" = "ns")
print_microbenchmark <- function (x, unit, order = NULL, ...) {
s <- summary(x, unit = unit)
cat("Unit: ", attr(s, "unit"), "\n", sep = "")
timing_cols <- c("min", "lq", "median", "uq", "max")
s[timing_cols] <- lapply(s[timing_cols], signif, digits = 3)
s[timing_cols] <- lapply(s[timing_cols], format, big.mark = ",")
print(s, ..., row.names = FALSE)
}
assignInNamespace("print.microbenchmark", print_microbenchmark,
"microbenchmark")
```
## Introduction
R is not a fast computer language. This is not an accident: R has been thoughtfully designed to make it easier for you to solve data analysis and statistics challenges, not to make your computer's life easier. While R is slow compared to other programming languages, it certainly can be fast enough for most purposes. The goal of this part of the book is to give you a deeper understanding of R's performance characteristics. You'll learn where R is fast, where it's slow, and how you can improve R's performance if it becomes a problem.
This chapter explores some of the reasons why R is slow, breaking them down into causes related to the design of the language and causes related to the current implementation. While the language can't be changed without breaking code, it's possible to change the implementation, and I'll conclude by discussing some of the exciting new implementations of R which promise considerable performance improvements. Throughout the chapter I'll use microbenchmarking to quantitatively explore R's performance characteristics.
The following four chapters will give you the skills to improve slow code:
* In [Profiling](#profiling), you'll learn how to make your code faster
by first systematically figuring what is slow and then applying some
general techniques that are often useful for improving performance.
* In [Memory](#memory), you'll learn about how R uses memory, and the
impact of garbage collection and modification in place on R's performance
and memory usage.
* For really high-performance code, you can move outside of R and use
another programming language. [Rcpp](#rcpp) will teach you the absolute
minimum you need to know about C++ in order to write fast code using the
Rcpp package.
* Finally, if you want to deeply understand the performance of built in
base functions, you'll need to learn a little bit about R's C api. In
[R's C interface](#c-api), you'll learn more about R's C internals
and see how some common built-in functions work.
Let's get started by learning more about why R is slow.
## Why is R slow?
To understand R's performance it helps to think about R in two ways: as a language and an implementation. The R-language is abstract. It defines what R code means and how it should work. An implementation is concrete: you give it R code and it computes the result. The most popular implementation of R is the one you download from [r-project.org](http://r-project.org). I'll call it GNU-R to distinguish it from the R-language, and from the other implementations we'll talk about later in the chapter.
The distinction between language and implementation is a bit murky for R because the R-language is not formally defined. R has the [R language definition](http://cran.r-project.org/doc/manuals/R-lang.html), but it is informal and incomplete: the R-language is mostly defined in terms of how GNU-R works. Other languages, like [C++](http://isocpp.org/std/the-standard) and [javascript](http://www.ecma-international.org/publications/standards/Ecma-262.htm) make the distinction more clear. They have formal specifications that describe in minute detail how every aspect of the language should work. Even though the distinction between R-language and GNU-R isn't clear cut, it's still useful because poor performance due to the implementation can be fixed relatively easily, while poor performance related to the language is hard to fix without changing what R code means.
[Language performance](#language-performance) discusses some of the ways in which the design of the R language imposes fundamental constraints on R's speed. While language design constrains the maximum possible performance, GNU-R is currently far from the optimum. [Implementation performance](#implementation-performance) discusses why and why performance improvements happen so slowly. It's hard to know exactly how much faster a better implementation could be, but a >10x improvement in speed seems achievable. In [alternative implementations](#faster-r), I'll discuss some of the promising new implementations, and describe one important technique they use to make R code faster.
As well as performance limitations imposed by the language design and current implementation, a lot of R code is slow because it's poorly written. Few R users have any formal training in programming or software development, and fewer still write R code for a living. Most people coding in R use it to better understand their data: it's more important to get an answer quickly than to develop a system that will work for a wide variety of inputs. This sounds like a huge negative for R, but it's really a huge positive because it shows that R is useful for non-programmers. It also means that it's relatively easy to make most R code much faster, as described in the following chapters.
Before we continue on to explore some of the slow parts of the R-language and GNU-R, we need to learn a little about benchmarking, so that we can make our intuition about performance concrete.
## Microbenchmarking
A microbenchmark is a performance measurement of a very small piece of code, something that might take microseconds (µs) or nanoseconds (ns) to run. I'm going to use microbenchmarks to demonstrate the performance of very low-level pieces of R, to help develop your intuition for how R works. This intuition, by-and-large, is not useful for increasing the speed of real code. In the same way that understanding physics doesn't help you understand practical chemistry, the small times in microbenchmarks are typically dominated by higher-order effects in real code. Don't change the way you code because of these microbenchmarks; instead wait until the next chapter to see how to improve the performance of real code.
The best tool for microbenchmarking in R is the [microbenchmark](http://cran.r-project.org/web/packages/microbenchmark/) package. It provides very precise timings and makes it possible to compare operations that only take a tiny amount of time. For example, the following code compares the speed of two ways of computing a square root.
```{r}
library(microbenchmark)
microbenchmark(x, x^2) # nano seconds
x <- runif(100)
microbenchmark(
sqrt(x),
x ^ 0.5
)
## to do system.time() ?
Ex <- expression(id=x, sq=x^2, x.x= x*x, rt=sqrt(x), p0.5 = x ^ 0.5)
lapply(Ex, system.time) ## all 0
lapply(Ex, function(expr) system.time(for(i in 1:10000) r <- eval(expr)))
system.time(for(i in 1:100000) r <- x)
sapply(as.list(Ex), class)
mb <- do.call(microbenchmark, as.list(Ex))
mb <- do.call(microbenchmark,
c(as.list(Ex), times=200))
plot(mb)
plot(mb, log="y", notch=TRUE)
```
In this example, you can see that using the special purpose `sqrt()` function is faster than the general exponentiation operator. As with all microbenchmarks, pay careful attention to the units: each computation takes about 800 ns, 800 billionths of a second. To help calibrate the impact of a microbenchmark on run time, it's useful to think about how many times a function needs to run before it takes a second. If a microbenchmark takes:
* 1 ms, then one thousand calls takes a second
* 1 µs, then one million calls takes a second
* 1 ns, then one billion calls takes a second
The `sqrt()` function takes about 800 ns, or 0.8 µs, to compute the square root of 100 numbers. That means if you repeated the operation a million times, it would take 0.8s. This is unlikely to be a bottleneck in real code.
`microbenchmark()` takes multiple expressions as input, and displays summaries of the distribution of times. By default, `microbenchmark()` runs each expression 100 times (controlled by the `times` parameter), randomising the order of the expressions. It summarises the results with a minimum (`min`), lower quartile (`lq`), median, upper quartile (`uq`) and maximum (`max`). You normally want to focus on the median, but the upper and lower quartiles (`lq` and `uq`) are also useful to get a feel for the variability. You can see the complete distribution with the `boxplot()` method, or if ggplot2 is loaded, with `autoplot()`.
### Exercises
* Instead of using `microbenchmark()`, you could use the built-in function
`system.time()`. But `system.time()` is much less precise, so you'll
need to repeat each operation many times with a loop, and then divide
to find the average time of each operation, as in the code below.
How do the estimates from `system.time()` compare to those from
`microbenchmark()`? Why are they different?
```{r, eval = FALSE}
n <- 1:1e6
system.time(for (i in n) sqrt(x)) / length(n)
system.time(for (i in n) x ^ 0.5) / length(n)
```
* Here are two other ways to compute the square root of a vector. Which
do you think will be fastest? Which will be slowest? Use microbenchmarking
to confirm your answers.
```{r, eval = FALSE}
x ^ (1 / 2)
exp(log(x) / 2)
```
* Use microbenchmarking to rank the basic arithmetic operators (`+`, `-`,
`*`, `/`, and `^`) in terms of speed. Visualise the results. Compare
the speed of operating on integers vs. doubles.
* You can change the units in which the microbenchmark results are
expressed with the `unit` parameter. Use `unit = "eps"` to show
the number of evaluations needed to take 1 second. Repeat the benchmarks
above with the eps unit. How does it change your intuition for performance?
## Language performance
In this section I'll explore three trade-offs that limit the performance of the R-language: extreme dynamism, name lookup with mutable environments, and lazy evaluation of function arguments. I'll illustrate each trade-off a microbenchmark, showing how it slows GNU-R down. You can't benchmark a language, since it's an abstract construct, so the benchmarks are only suggestive of the cost of these decisions to the language, but are nevertheless useful. Designing a useful language is a delicate balancing act. There are many options and many tradeoffs, and you need to balance between speed, flexibility and ease of implementation. I hope these three examples will give you a sense of this balance.
If you'd like to learn more about the performance characteristics of the R-language and how they affect real code, I highly recommend [Evaluating the Design of the R Language](https://www.cs.purdue.edu/homes/jv/pubs/ecoop12.pdf) by Floreal Morandat, Brandon Hill, Leo Osvald and Jan Vitek. It discusses a powerful methodology for understanding the performance characteristics of GNU-R using a modified R interpreter and a wide set of code found in the wild.
### Extreme dynamism
R is an extremely dynamic programming language, and almost anything can be modified after it is created. To give just a few examples, you can:
* change the body, arguments and environment of functions
* change the S4 methods for a generic
* add new fields to an S3 object, or even change its class
* modify objects outside of the local environment with `<<-`
Pretty much the only thing you can't change are objects in sealed namespaces. After a package is loaded its namespace is sealed and it is harder (although still not impossible) to change objects defined by the package.
The advantage of dynamism is that you don't need to do upfront planning, and you don't need an initial compilation step. You can change your mind at any point, iterating you way to a solution without having to start afresh.
The disadvantage of dynamism is that it makes it difficult to predict exactly what will happen for a given function call. The easier it is to predict what's going to happen, the more likely an interpreter or compiler can jump directly to the correct method. (If you'd like more details, Charles Nutter expands on this idea at [On Languages, VMs, Optimization, and the Way of the World](http://blog.headius.com/2013/05/on-languages-vms-optimization-and-way.html).) If an interpreter can't predict what's going to happen, it has to look through many options to find the right one, a slow operation. For example, code like the following loop is slow in R, because R can't know that the type of `x` never changes. That means R has to look for the right `+` method (e.g. is it adding doubles, or integers) in every iteration of the loop.
```{r, eval = FALSE}
x <- 0L
for (i in 1:1e6) {
x <- x + 1
}
```
The cost of finding the right method is higher for non-primitive functions. The following microbenchmark illustrates the cost of method dispatch for S3, S4, and RC. I create a generic and a method for each OO system, then call the generic and see how long it takes to find and call the method. I also time how long it takes to call the bare method for comparison.
```{r, results = 'hide'}
f <- function(x) NULL
s3 <- function(x) UseMethod("s3")
s3.integer <- f
A <- setClass("A", representation(a = "list"))
setGeneric("s4", function(x) standardGeneric("s4"))
setMethod(s4, "A", f)
B <- setRefClass("B", methods = list(rc = f))
a <- A()
b <- B$new()
```
```{r}
microbenchmark(
fun = f(),
S3 = s3(1L),
S4 = s4(a),
RC = b$rc()
)
```
On my computer, the function call takes about 280 ns. S3 method dispatch takes an additional 3,000 ns; S4 dispatch, 13,000 ns; and RC dispatch, 11,000 ns. S3 and S4 method dispatch is so expensive because R must search for the right method every time the generic is called; it might have changed between this call and the last. R could do better by caching methods between calls, but caching is hard to do correctly and a notorious source of bugs.
### Name lookup with mutable environments
It's surprisingly difficult to find the value associated with a name in the R-language because of the combination of lexical scoping and extreme dynamism. Take the following example. Each time we print `a` it comes from a different environment:
```{r}
a <- 1
f <- function() {
g <- function() {
print(a)
assign("a", 2, envir = parent.frame())
print(a)
a <- 3
print(a)
}
g()
}
f()
```
This means that you can't rely on the binding associated with a name being in the same place as it was last time you looked: you have to start from scratch each time. This problem is exacerbated because almost every operation is a lexically scoped function call. For example, you might think the following simple function calls two functions: `+` and `*`. In fact, it calls four because because `{` and `(` are regular functions in R.
```{r}
f <- function(x, y) {
(x + y) ^ 2
}
```
Since these functions are in the base environment, R has to look through every environment on the search path, which could easily be 10 or 20 environments. The following microbenchmark hints at the performance costs. We create four versions of `f()`, each with one more environment (containing 26 bindings) between the environment of `f()` and the base environment where `+`, `^`, `(`, and `{` are defined.
```{r}
random_env <- function(parent = globalenv()) {
letter_list <- setNames(as.list(runif(26)), LETTERS)
list2env(letter_list, envir = new.env(parent = parent))
}
set_env <- function(f, e) {
environment(f) <- e
f
}
f2 <- set_env(f, random_env())
f3 <- set_env(f, random_env(environment(f2)))
f4 <- set_env(f, random_env(environment(f3)))
microbenchmark(
f(1, 2),
f2(1, 2),
f3(1, 2),
f4(1, 2),
times = 1000
)
```
Each additional environment between `f()` and the base environment makes the function slower by about 50ns.
It might be possible to implement a caching system so that R only needs to look up the value of each name once. This is hard because there are so many ways to change the value associated with a name: `<<-`, `assign()`, `eval()`, ... Any caching system would have to connect to each of these tools to make sure the cache was correctly invalidated and you didn't get an old (incorrect) value.
Another simple fix would be to add more built in constants that you can't override. This, for example, would mean that you always know exactly what `+`, `-`, `{` and `(` mean, and you don't need to repeatedly look up their definitions. The costs of that decision are that writing the interpreter is a little harder (because there are more special cases), and the language isn't quite as flexible. This would change the R-language, but it would be unlikely to affect much existing code because it's such a bad idea to override functions like `{` and `(`.
### Lazy evaluation overhead
In R, functions arguments are evaluated lazily (as discussed in [lazy evaluation](#lazy-evaluation) and [capturing expressions](#capturing-expressions)). To implement lazy evaluation, R uses a promise object that contains the expression needed to compute the result, and the environment in which to perform the computation. Creating these objects has some overhead, so every additional argument to an R function slows it down a little.
The following microbenchmark compares the run time of a very simple function. Each version of the function has one extra argument. This suggests that each additional argument slows the function down by 20 ns.
```{r promise-cost}
f0 <- function() NULL
f1 <- function(a = 1) NULL
f2 <- function(a = 1, b = 1) NULL
f3 <- function(a = 1, b = 2, c = 3) NULL
f4 <- function(a = 1, b = 2, c = 4, d = 4) NULL
f5 <- function(a = 1, b = 2, c = 4, d = 4, e = 5) NULL
microbenchmark(f0(), f1(), f2(), f3(), f4(), f5(), times = 1000)
```
In most other programming languages there is little overhead for adding extra arguments. Many compiled languages will even warn you if arguments are never used (like in the above example), and automatically remove them from the function.
### Exercises
* How does the performance of S3 method dispatch change with the length
of the class vector? How does performance of S4 method dispatch change
with number of superclasses? How about RC?
* What is the cost of multiple inheritance and multiple dispatch on
S4 method dispatch?
* Why is the cost of lookup less for for functions in the base package?
* `scan()` has the most arguments (21) of any base function. About how
much time does it take to make 21 promises each time scan is called?
Given a simple input (e.g. `scan(text = "1 2 3", quiet = T)`) what
proportion of the total run time is due to creating those promises?
* Read [Evaluating the Design of the R Language](https://www.cs.purdue.edu/homes/jv/pubs/ecoop12.pdf). What other aspects of the R-language slow it
down? Construct microbenchmarks to illustrate.
## Implementation performance
The design of R-language limits its maximum theoretical performance. But GNU-R is currently nowhere close to that maximum, and there are many things that can (and will) be done to speed it up. This section discusses some of the parts of GNU-R that are currently slow not because of the language, but because no one has had the time to make them fast.
R is over 20 years old, and contains nearly 800,000 lines of code (about 45% C, 19% R, and 17% fortran). Changes to base R can only be made by members of the R Core Team (or R-core for short). R-core contains [20 members](http://www.r-project.org/contributors.html), but only six are actively involved in day-to-day development. No one on R-core is paid to work on R full time. Most are statistics professors, and can only spend a relatively small amount of their time working on R. Because of the care that must be taken to avoid breaking existing code, R-core tends to be very conservative about accepting new code. It can be frustrating to see obvious performance improvements rejected by R-core, but the driving motivation for R-core is not to make R faster, it's to build a stable platform for statistical computing.
Next, I'll show two small, but illustrative, parts of R that are currently slow but could be made faster with some effort. They are not critical parts of base R, but they have frustrated me in the past. As with all microbenchmarks, these won't affect the performance of most code, but can be important for special cases.
### Extracting a single value from a data frame
The following microbenchmark shows seven ways to access a single value (the number in the bottom-right corner) from the built-in `mtcars` dataset. The variation in performance is startling: the slowest method takes 30x longer than the fastest method. There's no reason for there to be such a huge difference in performance, but no one has spent the time to make the slowest methods faster.
```{r}
microbenchmark(
mtcars[32, 11],
mtcars[[c(11, 32)]],
mtcars[[11]][32],
mtcars$carb[32],
.subset2(mtcars, 11)[32]
)
```
### `ifelse()`, `pmin()`, and `pmax()`.
Some base functions are known to be slow. For example, look at the following three implementations of a function to `squish()` a vector to ensure that the smallest value is at least `a` and the largest value is at most `b`. The first implementation, `squish_ife()` uses `ifelse()`. `ifelse()` is known to be slow because it is relatively general, and must evaluate all arguments fully. The second implementation, `squish_p()`, uses `pmin()` and `pmax()` which should be much faster because they're so specialised. But they're actually rather slow because they can take any number of arguments and have to do some relatively complicated checks to determine which method to use. The final implementation uses basic subassignment.
```{r}
squish_ife <- function(x, a, b) {
ifelse(x <= a, a, ifelse(x >= b, b, x))
}
squish_p <- function(x, a, b) {
pmax(pmin(x, b), a)
}
squish_in_place <- function(x, a, b) {
x[x <= a] <- a
x[x >= b] <- b
x
}
x <- runif(100, -1.5, 1.5)
microbenchmark(
squish_ife(x, -1, 1),
squish_p(x, -1, 1),
squish_in_place(x, -1, 1)
)
```
There's quite a variation in speed: using `pmin()` and `pmax()` is about 3x faster than using `ifelse()`, and using subsetting directly is about twice as fast again. As we'll see in [Rcpp](#rcpp), we can often do even better by calling out to C++. The following example compares the best R implementation to a relatively simple (if verbose) implementation in C++. Even if you've never used C++, you should be able to follow the basic strategy: we loop over every element in the vector and perform a different action depending on whether the value is less than `a`, greater than `b`, or is ok. The C++ implementation is around 3x faster than the best pure R implementation.
```{r, engine = "Rcpp"}
#include
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector squish_cpp(NumericVector x, double a, double b) {
int n = x.length();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
double xi = x[i];
if (xi < a) {
out[i] = a;
} else if (xi > b) {
out[i] = b;
} else {
out[i] = xi;
}
}
return out;
}
```
```{r}
microbenchmark(
squish_in_place(x, -1, 1),
squish_cpp(x, -1, 1)
)
```
### Exercises
* The performance characteristics of `squish_ife()`, `squish_p()` and
`squish_in_place()` vary considerably with the size of `x`. Explore the
differences. For what sizes of input is the difference biggest? Smallest?
* Compare the performance costs of extract an element from a list, a
column from a matrix, and a column from a data frame. What about a row?
## Alternative R implementations {#faster-r}
There are some exciting new implementations of R. They all try to stick as closely as possible to the existing language definition, but implement ideas from modern interpreter design to make the implementation as fast as possible. The four most mature open-source projects are:
* [pqR](http://www.pqr-project.org/) (pretty quick R), by Radford Neal. It's
built on top of the existing R code base (2.15.0), and fixes many obvious
performance issues. It provides better memory management, and some
support for automatic multithreading.
* [Renjin](http://www.renjin.org/) by BeDataDriven. Renjin uses the
java virtual machine, and has an extensive
[test suite](http://packages.renjin.org/).
* [fastr](https://github.com/allr/fastr), by a team from Purdue. fastr
is similar to Renjin, but it makes more ambitious optimisations and
is somewhat less mature.
* [Riposte](https://github.com/jtalbot/riposte), by Justin Talbot and
Zachary DeVito. Riposte is experimental and ambitious, and
for the parts of R it implements is extremely fast. Riposte is
described in more detail in [Riposte: A Trace-Driven Compiler and Parallel
VM for Vector Code in R](http://www.justintalbot.com/wp-content/uploads/2012/10/pact080talbot.pdf).
These are roughly ordered in from most practical to most ambitious. These projects promise considerable speed ups over base R. However, bear in mind that the worst bottlenecks in R have often already been re-written in C or Fortran. This means that even if R did get 10x or 100x faster, the overall impact on your code is likely to be much smaller.
Another project, [CXXR](http://www.cs.kent.ac.uk/projects/cxxr/) by Andrew Runnalls, does not provide any performance improvements, but instead aims to refactor R's internal C code into a stronger foundation for future development. It aims to have identical behaviour to GNU-R, but has better documentation of internals and is more extensible.
R is a huge language and it's not clear whether any of these approaches will even become mainstream. It's hard task to make an alternative implementation run all R code in the same way to GNU-R. Can you imagine having to reimplement every function in base R to not only be faster, but to have exactly the same documented bugs? However, even if these implementations never make a dent in the use of GNU-R, they have other benefits:
* Simpler implementations make it easy to validate new approaches before
porting to GNU-R.
* Gain understanding about which aspects of language could be changed
with minimal impact to existing code and maximal impact on performance.
* Alternative implementations put pressure on the R-core to incorporate
performance improvements.
One of the most important approaches that pqR, renjin, fastR and riposte are exploring is the idea of deferred evaluation. As Justin Talbot, the author of riposte, points out "for long vectors, R's execution is completely memory bound. It spends almost all of its time reading and writing vector intermediates to memory". If you can eliminate intermediate vectors, you can not only decrease memory usage, you can also considerably improve performance.
The following example shows a very simple example of where deferred evaluation might help. We have three vectors, `x`, `y`, `z` each containing 1 million elements, and we want to find the sum of `x` + `y` where `z` is TRUE. (This represents a simplification of a pretty common sort of data analysis question.)
```{r, results = "hide"}
x <- runif(1e6)
y <- runif(1e6)
z <- sample(c(T, F), 1e6, rep = TRUE)
sum((x + y)[z])
```
In R, this creates two big temporary vectors: `x + y`, 1 million elements long, and `(x + y)[z]`, about 500,000 elements long. This means you need to have extra memory free for the intermediate calculation, and you have to shuttle the data back and forth between the CPU and memory. This slows computation down because the CPU can't work at maximum efficiency if it's always waiting for more data to come in.
If we rewrote the function using a loop in a language like C++, we could recognise that we only need one intermediate value: the sum of all the values we've seen:
```{r, engine = "Rcpp"}
#include
using namespace Rcpp;
// [[Rcpp::export]]
double cond_sum_cpp(NumericVector x, NumericVector y, LogicalVector z) {
double sum = 0;
int n = x.length();
for(int i = 0; i < n; i++) {
if (!z[i]) continue;
sum += x[i] + y[i];
}
return sum;
}
```
On my computer, this approach is about 8 times faster than the vectorised R equivalent (which is already pretty fast).
```{r}
cond_sum_r <- function(x, y, z) {
sum((x + y)[z])
}
microbenchmark(
cond_sum_cpp(x, y, z),
cond_sum_r(x, y, z),
unit = "ms"
)
```
The goal of deferred evaluation is to perform this transformation automatically, so you can write concise R code and have it automatically translated into efficient machine code. Sophisticated translators can also figure out how to make the most of multiple cores. In the above example, if you have four cores, you could split `x`, `y` and `z` into 4 pieces performing the conditional sum on each core, then adding together the four individual results. Deferred evaluation can also work with for loops, automatically discovering operations that can be vectorised.
This chapter has discussed some of the fundamental reasons that R is slow. The following chapters will give you the tools to do something about it when it impacts your code.