[R] How to benchmark speed of load/readRDS correctly

Wed Aug 23 15:31:54 CEST 2017

First of all I want to mention the _warmup_ parameter to _control_argument of microbenchmark function. Default value is 2 and function runs the code 2 times before count the time intervals.
See ?microbenchmark

> However, the first iteration always takes the longest. I'm wondering if I should take the first value as best guess.

So, at least for the microbenchmark function, the maximum iteration time in microbenchmark result is not relevant to first iteration but may be relevant to another processes/factors that reach the file system at the same time with your code.

Also we can examine the underlying code of load and readRDS functions. Simply, type _load_ and _readRDS_ in the terminal and see the codes.

# readRDS
function (file, refhook = NULL) 
{
    if (is.character(file)) {
        con <- gzfile(file, "rb")
        on.exit(close(con))
    }
    else if (inherits(file, "connection")) 
        con <- if (inherits(file, "gzfile") || inherits(file, 
            "gzcon")) 
            file
        else gzcon(file)
    else stop("bad 'file' argument")
    .Internal(unserializeFromConn(con, refhook))
}

link to unserializeFromConn -> https://github.com/wch/r-source/blob/94cd276ed0eef865e01fcf4e96925d9373cc5799/src/main/serialize.c#L2246

# load
function (file, envir = parent.frame(), verbose = FALSE) 
{
    if (is.character(file)) {
        con <- gzfile(file)
        on.exit(close(con))
        magic <- readChar(con, 5L, useBytes = TRUE)
        if (!length(magic)) 
            stop("empty (zero-byte) input file")
        if (!grepl("RD[AX]2\\n", magic)) {
            if (grepl("RD[ABX][12]\\r", magic)) 
                stop("input has been corrupted, with LF replaced by CR")
            warning(sprintf("file %s has magic number '%s'\\n", 
                sQuote(basename(file)), gsub("[\\n\\r]*", "", magic)), 
                "  ", "Use of save versions prior to 2 is deprecated", 
                domain = NA, call. = FALSE)
            return(.Internal(load(file, envir)))
        }
    }
    else if (inherits(file, "connection")) {
        con <- if (inherits(file, "gzfile") || inherits(file, 
            "gzcon")) 
            file
        else gzcon(file)
    }
    else stop("bad 'file' argument")
    if (verbose) 
        cat("Loading objects:\\n")
    .Internal(loadFromConn2(con, envir, verbose))
}

link to loadFromConn2 -> https://github.com/wch/r-source/blob/c1093fa1073fef6404869f26a1be6ef5bd2aa0fd/src/main/saveload.c#L2329

both use an internal function called “unserializeFromConn” and “ loadFromConn2”. You can examine them at attached links in the text.

Even if we don’t know C/C++, we can conclude that both functions have similar codes to read the data. Also _load_ function has much lines of codes than _readRDS_ function to check some bytes. (this also may create a small difference as you find out in your tests, see mean and median)

Additionally, I want to discuss another aspect. Why are there 2 functions called _readRDS_ and _load_?

Because they have different purposses. You use _load _function to read/load bulk saved variables by _save_ function and you use _readRDS_ function to read/load a single variable saved by _saveRDS_ function. So, it is inevitable both functions are optimized for different purposes. This is something like compare apple and pear altough both are fruits. Sometimes you want to eat apple but sometimes pear.

Aas a result, if I need to save and read a single R object, I prefer readRDS/saveRDS couple. If I need to serialize multiple object to a file, I use the load/save couple.

> On 23 Aug 2017, at 15:40, <raphael.felber at agroscope.admin.ch> <raphael.felber at agroscope.admin.ch> wrote:
> 
> Hi there
> 
> Thanks for your answers. I didn't expect that this would be so complex. Honestly, I don't understand everything you wrote since I'm not an IT specialist. But I read something that reading *.rds files is faster than loading *.Rdata and I wanted to proof that for my system and R version. But thanks anyway for your time.
> 
> Cheers Raphael
> 
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Jeff Newmiller [mailto:jdnewmil at dcn.davis.ca.us]
>> Gesendet: Dienstag, 22. August 2017 18:33
>> An: J C Nash <profjcnash at gmail.com>; r-help at r-project.org; Felber Raphael
>> Agroscope <raphael.felber at agroscope.admin.ch>
>> Betreff: Re: [R] How to benchmark speed of load/readRDS correctly
>> 
>> Caching happens, both within the operating system and within the C
>> standard library. Ostensibly the intent for those caches is to help
>> performance, but you are right that different low-level caching algorithms
>> can be a poor match for specific application level use cases such as copying
>> files or parsing text syntax. However, the OS and even the specific file
>> system drivers (e.g. ext4 on flash disk or FAT32 on magnetic media) can
>> behave quite differently for the same application level use case, so a generic
>> discussion at the R language level (this mailing list) can be almost impossible
>> to sort out intelligently.
>> --
>> Sent from my phone. Please excuse my brevity.
>> 
>> On August 22, 2017 7:11:39 AM PDT, J C Nash <profjcnash at gmail.com>
>> wrote:
>>> Not convinced Jeff is completely right about this not concerning R,
>>> since I've found that the application language (R, perl, etc.) makes a
>>> difference in how files are accessed by/to OS. He is certainly correct
>>> that OS (and versions) are where the actual reading and writing
>>> happens, but sometimes the call to those can be inefficient. (Sorry,
>>> I've not got examples specifically for file reads, but had a case in
>>> computation where there was an 800% i.e., 80000 fold difference in
>>> timing with R, which rather took my breath away. That's probably been
>>> sorted now.) The difficulty in making general statements is that a
>>> rather full set of comparisons over different commands, datasets, OS
>>> and version variants is needed before the general picture can emerge.
>>> Using microbenchmark when you need to find the bottlenecks is how I'd
>>> proceed, which OP is doing.
>>> 
>>> About 30 years ago, I did write up some preliminary work, never
>>> published, on estimating the two halves of a copy, that is, the reading
>>> from file and storing to "memory" or a different storage location. This
>>> was via regression with a singular design matrix, but one can get a
>>> minimal length least squares solution via svd. Possibly relevant today
>>> to try to get at slow links on a network.
>>> 
>>> JN
>>> 
>>> On 2017-08-22 09:07 AM, Jeff Newmiller wrote:
>>>> You need to study how reading files works in your operating system.
>>> This question is not about R.
>>>> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.