[R] How to benchmark speed of load/readRDS correctly

Tue Aug 22 15:07:13 CEST 2017

You need to study how reading files works in your operating system. This question is not about R.
-- 
Sent from my phone. Please excuse my brevity.

On August 22, 2017 5:53:09 AM PDT, raphael.felber at agroscope.admin.ch wrote:
>Dear all
>
>I was thinking about efficient reading data into R and tried several
>ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The
>files file.Rdata and file.rds contain the same data, the first created
>with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d,
>' file.rds', compress=F).
>
>First I used the function microbenchmark() and was a astonished about
>the max value of the output.
>
>FIRST TEST:
>> library(microbenchmark)
>> microbenchmark(
>+   n <- readRDS('file.rds'),
>+   load('file.Rdata')
>+ )
>Unit: milliseconds
>expr                     min                lq                      
>mean                    median                uq                       
>   max                      neval
>n <- readRDS(fl1)        106.5956      109.6457         237.3844       
>    117.8956              141.9921              10934.162           100
>load(fl2)                  295.0654      301.8162        335.6266      
>  308.3757              319.6965              1915.706              100
>
>It looks like the max value is an outlier.
>
>So I tried:
>SECOND TEST:
>> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
>elapsed               elapsed               elapsed              
>elapsed               elapsed               elapsed              
>elapsed               elapsed                 elapsed              
>elapsed
>10.50                   0.11                       0.11                
>0.11                       0.10                       0.11             
>0.11                       0.11                       0.12             
>         0.12
>> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
>elapsed               elapsed               elapsed              
>elapsed               elapsed               elapsed              
>elapsed               elapsed                 elapsed              
>elapsed
>1.86                    0.29                       0.31                
>0.30                       0.30                       0.31             
>0.30                       0.29                       0.31             
>         0.30
>
>Which confirmed my suspicion; the first time loading the data takes
>much longer than the following times. I suspect that this has something
>to do how the data is assigned and that R doesn't has to 'fully' read
>the data, if it is read the second time.
>
>So the question remains, how can I make a realistic benchmark test?
>From the first test I would conclude that reading the *.rds file is
>faster. But this holds only for a large number of neval. If I set times
>= 1 then reading the *.Rdata would be faster (as also indicated by the
>second test).
>
>Thanks for any help or comments.
>
>Kind regards
>
>Raphael
>------------------------------------------------------------------------------------
>Raphael Felber, PhD
>Scientific Officer, Climate & Air Pollution
>
>Federal Department of Economic Affairs,
>Education and Research EAER
>Agroscope
>Research Division, Agroecology and Environment
>
>Reckenholzstrasse 191, CH-8046 Z�rich
>Phone +41 58 468 75 11
>Fax     +41 58 468 72 01
>raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch>
>www.agroscope.ch<http://www.agroscope.ch/>
>
>
>	[[alternative HTML version deleted]]