[R] How to benchmark speed of load/readRDS correctly

Wed Aug 23 14:31:13 CEST 2017

Hi Bill

Thanks for your answer and the explanations. I tried to use garbage collection but I'm still not satisfied with the result. Maybe the question was not stated clear enough. I want to test the speed of reading/loading of data into R when a 'fresh' R session is started (or even after a new start of the computer).

To understand what really happens, I tried:
r1 <- sapply(1:10000, function(x) { gc(); t <- system.time(n <- readRDS('file.Rdata'))[3]; rm(n); gc(); return(t)})

and found a similar behavior as you; here and then the time is much larger, but the times are not as stable as in your example. Highest values are up to 50 times larger than most of the other times (8 sec vs 0.15 sec), even with garbage collection. I assume with the code above the time spent for garbage collection isn't measured.

However, the first iteration always takes the longest. I'm wondering if I should take the first value as best guess.

Cheers Raphael
Von: William Dunlap [mailto:wdunlap at tibco.com]
Gesendet: Dienstag, 22. August 2017 19:13
An: Felber Raphael Agroscope <raphael.felber at agroscope.admin.ch>
Cc: r-help at r-project.org
Betreff: Re: [R] How to benchmark speed of load/readRDS correctly

Note that if you force a garbage collection each iteration the times are more stable.  However, on the average it is faster to let the garbage collector decide when to leap into action.

mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000, control=list(order="inorder"))
with(mb_gc, plot(time[expr!="gc()"]))
with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99, 1)))
#       0%       50%       75%       90%       95%       99%      100%
# 59.33450  61.33954  63.43457  66.23331  68.93746  74.45629 158.09799

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <wdunlap at tibco.com<mailto:wdunlap at tibco.com>> wrote:
The large value for maximum time may be due to garbage collection, which happens periodically.   E.g., try the following, where the unlist(as.list()) creates a lot of garbage.  I get a very large time every 102 or 51 iterations and a moderately large time more often

mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
plot(mb$time)
quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
#       0%       50%       75%       90%       95%       99%      100%
# 59.04446  82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
diff(which(mb$time > quantile(mb$time, .99)))
# [1] 102  51 102 102 102 102 102 102  51
diff(which(mb$time > quantile(mb$time, .95)))
# [1]  6 41  4 47  4 40  7  4 47  4 33 14  4 47  4 47  4 47  4 47  4 47  4  6 41
#[26]  4  6  7  9 25  4 47  4 47  4 47  4 22 25  4 33 14  4  6 41  4 47  4 22

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch>> wrote:
Dear all

I was thinking about efficient reading data into R and tried several ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata and file.rds contain the same data, the first created with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds', compress=F).

First I used the function microbenchmark() and was a astonished about the max value of the output.

FIRST TEST:
> library(microbenchmark)
> microbenchmark(
+   n <- readRDS('file.rds'),
+   load('file.Rdata')
+ )
Unit: milliseconds
              expr                     min                lq                       mean                    median                uq                           max                      neval
n <- readRDS(fl1)        106.5956      109.6457         237.3844              117.8956              141.9921              10934.162           100
         load(fl2)                  295.0654      301.8162        335.6266              308.3757              319.6965              1915.706              100

It looks like the max value is an outlier.

So I tried:
SECOND TEST:
> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed                 elapsed               elapsed
  10.50                   0.11                       0.11                       0.11                       0.10                       0.11                       0.11                       0.11                       0.12                       0.12
> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed               elapsed                 elapsed               elapsed
   1.86                    0.29                       0.31                       0.30                       0.30                       0.31                       0.30                       0.29                       0.31                       0.30

Which confirmed my suspicion; the first time loading the data takes much longer than the following times. I suspect that this has something to do how the data is assigned and that R doesn't has to 'fully' read the data, if it is read the second time.

So the question remains, how can I make a realistic benchmark test? From the first test I would conclude that reading the *.rds file is faster. But this holds only for a large number of neval. If I set times = 1 then reading the *.Rdata would be faster (as also indicated by the second test).

Thanks for any help or comments.

Kind regards

Raphael
------------------------------------------------------------------------------------
Raphael Felber, PhD
Scientific Officer, Climate & Air Pollution

Federal Department of Economic Affairs,
Education and Research EAER
Agroscope
Research Division, Agroecology and Environment

Reckenholzstrasse 191, CH-8046 Zürich
Phone +41 58 468 75 11<tel:+41%2058%20468%2075%2011>
Fax     +41 58 468 72 01<tel:+41%2058%20468%2072%2001>
raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch><mailto:raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch>>
www.agroscope.ch<http://www.agroscope.ch><http://www.agroscope.ch/>

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]