rde Tutorial

Stefan Kloppenborg

2018-07-01

When sharing R Notebooks with others, it’s not uncommon for the notebook to reference data that is only available on your machine. It could be that the recipient does not have access to a certain database, or it could be as simple as you forgetting to email them a CSV file with the data. In either of these cases, the analysis in the notebook is not self-contained. The package rde solves this problem by allowing you to embed the data directly in the notebook.

If you’re running on an X11 system (i.e. Linux, or similar), please read the section on configuring the clipboard below before proceeding.

Let’s take an example. Let’s say that we have a spreadsheet of populations of the ten most populous countries (data originally taken from [1]). Somewhere near the top of our R Notebook, we have a code chunk that looks like the following:

fname <- "country_pop.csv"
pop.data <- read.csv(fname, stringsAsFactors = FALSE)
kable(pop.data)
Country Population
China 1384688986
India 1296834042
United States 329256465
Indonesia 262787403
Brazil 208846892
Pakistan 207862518
Nigeria 195300343
Bangladesh 159453001
Russia 142122776
Japan 126168156

Now, if you send your notebook to someone else and don’t send along the file country_pop.csv, that person can look at your notebook, but they won’t be able to re-run it.

If you want to include the data directly in the notebook, you can use rde to do so.

rde provides two functions: load_rde_var and copy_rde_var. You’ll use load_rde_var in your notebook, and you’ll use copy_rde_var to create one of the arguments that load_rde_var needs.

The function load_rde_var takes three arguments. The first argument is a boolean (we’ll come back to this). The second argument is load.fcn. This is a piece of code that loads data from a source of your choosing (a CSV file, a database, etc.). This is the code that needs to work on your computer; it does not need to work on the computer of the notebook recipient. The third argument is cache. This argument is an encoded copy of the data.

When you call load_rde_var, the function will first try to load the data using the code in the load.fcn argument. If this fails, it will fall back on using the cache. In the latter case, it will give you a message to say that it used the cache instead of loading new data. This is what the recipient of your notebook would see if you neglected to send them the data file.

If load_rde_var succeeds in loading the data using the code in load.fcn, it will then compare this data with the data in cache. If there’s a difference, it will give you a warning. If you expected the data to change, you can go ahead and update the third argument (again using copy_rde_var); if you didn’t expect the data to change, well, now you know that it did change.

Now we’ll come back to that first argument of load_rde_var. This argument is a boolean called use.cache. This allows you to force load_rde_var to load data from the cache instead of running the code in load.fcn. Under most circumstances, this should be FALSE. However, sometimes, it may take a very long time to load your data from its original source (maybe the code executes a very long running database query, or scrapes a million webpages and just gives you a summary statistic). In the case that you don’t want to wait around while you load the data from its original source again, you can set that first argument to TRUE and just use the cached data.

Continuing on with our example of loading the populations of the ten most populous countries, we would start by wrapping our existing code inside the second argument of load_rde_var. It would now look the this:

library(rde)

pop.data <- load_rde_var(
  use.cache = FALSE,
  load.fcn = {
    read.csv(fname, stringsAsFactors = FALSE)
  },
  cache = NULL  # We'll fill this in shortly
)
#> Cache is empty or not a string
#> Warning in doTryCatch(return(expr), name, parentenv, handler): Cached data
#> is different from loaded data

If we run that code as is, it will raise a warning. We would expect this since there is nothing in the cache argument, so of course, the result of the load.fcn and cache are different. We’ll need to fill in cache argument of load_rde_var.

You’d normally start by loading your data into memory as you normally would (the code above would work fine). Once the data pop.data is in memory, you’re going to copy it into the cache argument of load_rde_var. You can use copy_rde_var to do so.

In the console, you would type:

copy_rde_var(pop.data)

When you execute this, your clipboard will contain some R code that will recreate the variable. Your clipboard will look like this:

rde1QlpoOTFBWSZTWQy+/kYAAIB3/v//6EJABRg/WlQv797wYkAAAMQiABBAACAAAZGwANk0RTKejU9T
RoBoGgGjTRoBoGgaGymE0Kp+qemmkDNQ0YmJk0AA0xNADQNPUaA0JRhDTJoANAAAAAAAAEJx2Eja7QBK
MKPPkRAx63wSAWt31AABs1zauhwHifs5WlltyIyQKAAAZEAZGQYMIZEA6ZAPHVMEB71jSCqdlsiR/eSY
kzQkRq5RoXgvNNZnB5RSOvKaTGFtc/SXc74AhzqhMEJvdisEGVfo7UYngc0AwGqTvTHx8CBZTzE9OQZZ
VY8KAhHAhrG4RCeilM0rXKkdpjGqyNgJwAkmnPQOMYrLlQ4YTIv0WyxfYdkd9WSWUsvggC/i7kinChIB
l9/IwA==

You can go ahead and paste that into the cache argument of load_rde_var. Make sure that you paste it inside a pair of quotes. The code at the top of your notebook will now look like the following. Line breaks and spaces within the cahce argument don’t matter, so don’t worry about indenting to make your code pretty.

library(rde)

pop.data <- load_rde_var(
  use.cache = FALSE,
  load.fcn = {
    fname <- system.file("extdata", "country_pop.csv", package = "rde")
    read.csv(fname, stringsAsFactors = FALSE)
  },
  cache = "
    rde1QlpoOTFBWSZTWQy+/kYAAIB3/v//6EJABRg/WlQv797wYkAAAMQiABBAACAAAZGwANk0RTKejU9T
    RoBoGgGjTRoBoGgaGymE0Kp+qemmkDNQ0YmJk0AA0xNADQNPUaA0JRhDTJoANAAAAAAAAEJx2Eja7QBK
    MKPPkRAx63wSAWt31AABs1zauhwHifs5WlltyIyQKAAAZEAZGQYMIZEA6ZAPHVMEB71jSCqdlsiR/eSY
    kzQkRq5RoXgvNNZnB5RSOvKaTGFtc/SXc74AhzqhMEJvdisEGVfo7UYngc0AwGqTvTHx8CBZTzE9OQZZ
    VY8KAhHAhrG4RCeilM0rXKkdpjGqyNgJwAkmnPQOMYrLlQ4YTIv0WyxfYdkd9WSWUsvggC/i7kinChIB
    l9/IwA==
  "
)

Now, when we run this, it won’t raise a warning because load.fcn and cache are the same.

If you send this notebook to someone else, but neglect to send the data file, they can now still play around with the data because it’s now directly in the code. They will, however, get a message indicating that the data has been loaded from cache.

What if you inadvertently change the data file? Or if you’re reading the data from a database that changes? Well, if that happens, load.fcn and cache won’t match. In this case, you’ll get a warning. This can be useful: maybe you didn’t expect the data to change, or maybe you need to update some of the text in your notebook — maybe some of your conclusions or explanation needs to change. Assuming that the change in the data file (or database) isn’t some sort of mistake, make sure that you update the value of the cache argument with the new data (again, you’ll use the copy_rde_var function to do so).

Installing on X11 Systems

If you’re on an X11 system (like Linux), you’ll need to install some additional software. You should not have to do this on Windows or Mac. On X11 systems, you’ll need to install either xsel or xclip. Depending on the distribution that you use, you will probably install it using a command like sudo apt-get install xsel

References

[1] U.S. Census Bureau, “Current Population.” [Online]. Available: https://www.census.gov/popclock/print.php?component=counter. [Accessed: 13-Mar-2018]