[R] How to benchmark speed of load/readRDS correctly
raphael.felber at agroscope.admin.ch
raphael.felber at agroscope.admin.ch
Wed Aug 23 14:31:13 CEST 2017
Hi Bill
Thanks for your answer and the explanations. I tried to use garbage collection but I'm still not satisfied with the result. Maybe the question was not stated clear enough. I want to test the speed of reading/loading of data into R when a 'fresh' R session is started (or even after a new start of the computer).
To understand what really happens, I tried:
r1 <- sapply(1:10000, function(x) { gc(); t <- system.time(n <- readRDS('file.Rdata'))[3]; rm(n); gc(); return(t)})
and found a similar behavior as you; here and then the time is much larger, but the times are not as stable as in your example. Highest values are up to 50 times larger than most of the other times (8 sec vs 0.15 sec), even with garbage collection. I assume with the code above the time spent for garbage collection isn't measured.
However, the first iteration always takes the longest. I'm wondering if I should take the first value as best guess.
Cheers Raphael
Von: William Dunlap [mailto:wdunlap at tibco.com]
Gesendet: Dienstag, 22. August 2017 19:13
An: Felber Raphael Agroscope <raphael.felber at agroscope.admin.ch>
Cc: r-help at r-project.org
Betreff: Re: [R] How to benchmark speed of load/readRDS correctly
Note that if you force a garbage collection each iteration the times are more stable. However, on the average it is faster to let the garbage collector decide when to leap into action.
mb_gc <- microbenchmark::microbenchmark(gc(), { x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000, control=list(order="inorder"))
with(mb_gc, plot(time[expr!="gc()"]))
with(mb_gc, quantile(1e-6*time[expr!="gc()"], c(0, .5, .75, .9, .95, .99, 1)))
# 0% 50% 75% 90% 95% 99% 100%
# 59.33450 61.33954 63.43457 66.23331 68.93746 74.45629 158.09799
Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>
On Tue, Aug 22, 2017 at 9:26 AM, William Dunlap <wdunlap at tibco.com<mailto:wdunlap at tibco.com>> wrote:
The large value for maximum time may be due to garbage collection, which happens periodically. E.g., try the following, where the unlist(as.list()) creates a lot of garbage. I get a very large time every 102 or 51 iterations and a moderately large time more often
mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <- unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
plot(mb$time)
quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
# 0% 50% 75% 90% 95% 99% 100%
# 59.04446 82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
diff(which(mb$time > quantile(mb$time, .99)))
# [1] 102 51 102 102 102 102 102 102 51
diff(which(mb$time > quantile(mb$time, .95)))
# [1] 6 41 4 47 4 40 7 4 47 4 33 14 4 47 4 47 4 47 4 47 4 47 4 6 41
#[26] 4 6 7 9 25 4 47 4 47 4 47 4 22 25 4 33 14 4 6 41 4 47 4 22
Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>
On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch>> wrote:
Dear all
I was thinking about efficient reading data into R and tried several ways to test if load(file.Rdata) or readRDS(file.rds) is faster. The files file.Rdata and file.rds contain the same data, the first created with save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, ' file.rds', compress=F).
First I used the function microbenchmark() and was a astonished about the max value of the output.
FIRST TEST:
> library(microbenchmark)
> microbenchmark(
+ n <- readRDS('file.rds'),
+ load('file.Rdata')
+ )
Unit: milliseconds
expr min lq mean median uq max neval
n <- readRDS(fl1) 106.5956 109.6457 237.3844 117.8956 141.9921 10934.162 100
load(fl2) 295.0654 301.8162 335.6266 308.3757 319.6965 1915.706 100
It looks like the max value is an outlier.
So I tried:
SECOND TEST:
> sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
10.50 0.11 0.11 0.11 0.10 0.11 0.11 0.11 0.12 0.12
> sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
1.86 0.29 0.31 0.30 0.30 0.31 0.30 0.29 0.31 0.30
Which confirmed my suspicion; the first time loading the data takes much longer than the following times. I suspect that this has something to do how the data is assigned and that R doesn't has to 'fully' read the data, if it is read the second time.
So the question remains, how can I make a realistic benchmark test? From the first test I would conclude that reading the *.rds file is faster. But this holds only for a large number of neval. If I set times = 1 then reading the *.Rdata would be faster (as also indicated by the second test).
Thanks for any help or comments.
Kind regards
Raphael
------------------------------------------------------------------------------------
Raphael Felber, PhD
Scientific Officer, Climate & Air Pollution
Federal Department of Economic Affairs,
Education and Research EAER
Agroscope
Research Division, Agroecology and Environment
Reckenholzstrasse 191, CH-8046 Zürich
Phone +41 58 468 75 11<tel:+41%2058%20468%2075%2011>
Fax +41 58 468 72 01<tel:+41%2058%20468%2072%2001>
raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch><mailto:raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch>>
www.agroscope.ch<http://www.agroscope.ch><http://www.agroscope.ch/>
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help
mailing list