Tracking Peak RAM Use and Minimizing Overhead

Thomas Quinn

2017-01-16

Quick start

When working with big datasets, RAM conservation is critically important. However, it is not always enough to just monitor the size of the objects created. So-called “copy-on-modify” behavior, characteristic of R, means that some expressions or functions may require an unexpectedly large amount of RAM overhead. For example, replacing a single value in a matrix (e.g., with ‘[<-’) duplicates that matrix in the backend, making this task require twice as much RAM as that used by the matrix itself. The peakRAM package makes it easy to monitor the total and peak RAM used so that developers can quickly identify and eliminate RAM hungry code. You can get started with peakRAM by installing this package directly from CRAN.

install.packages("peakRAM")
library(peakRAM)

Monitoring RAM overhead

The peakRAM package, inspired by the very elegant microbenchmark package, offers an easy way to monitor the total and peak RAM used by any number of R expressions or functions, including anonymous functions. Simply call peakRAM with any number of comma-separated expressions or functions provided as arguments. This function will execute each argument piecewise, recording the amount of RAM allocated as a result of that call (i.e., “Total RAM Used”) as well as the maximum amount of RAM allocated at any point during that call (i.e., “Peak RAM Used”). Note that throughout this package, all RAM use is measured in mebibytes (MiB).

peakRAM(function() 1:1e7,
        1:1e7,
        1:1e7 + 1:1e7,
        1:1e7 * 2)
##       Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
## 1 function()1:1e+07            0.040               38.1              38.1
## 2           1:1e+07            0.039               38.1              38.1
## 3   1:1e+07+1:1e+07            0.100               38.1              76.3
## 4         1:1e+07*2            0.058               76.3             114.4

Discussion

What happened here? Well, we see that initializing the vector 1:1e7 requires ~38 MiB of RAM, whether done through an anonymous function or not. Also, as we might expect, we see that adding 1:1e7 to 1:1e7 requires ~72 MiB of RAM, even though the result only occupies ~38 MiB of RAM, because the vector 1:1e7 is initialized twice.

When RAM is valuable and the object is large, we want to avoid this kind of overhead. To achieve this, we might try instead to double the vector 1:1e7, avoiding addition altogether. But wait, this uses even more RAM. Why? Well, multiplying by 2 in this case first copies the integer vector to a double vector, then multiplies the double vector by 2. For an instant, the original integer vector and new double vector exists simultaneously, occupying ~38 MiB plus ~72 MiB of RAM.

Alas, R is a most precarious lover. To conserve the maximum amount of RAM, we need an approach that makes no needless copies. In this case, we just need to force 2 to exist as an integer.

peakRAM(1:1e7 * 2:2)
##   Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
## 1   1:1e+07*2:2            0.045               38.2              38.2

Now, we have a solution that we can scale confidently, knowing for sure that we will not unwittingly exceed memory capacity through superfluous RAM overhead.