[R] How to benchmark speed of load/readRDS correctly
William Dunlap
wdunlap at tibco.com
Tue Aug 22 18:26:20 CEST 2017
The large value for maximum time may be due to garbage collection, which
happens periodically. E.g., try the following, where the
unlist(as.list()) creates a lot of garbage. I get a very large time every
102 or 51 iterations and a moderately large time more often
mb <- microbenchmark::microbenchmark({ x <- as.list(sin(1:5e5)); x <-
unlist(x) / cos(1:5e5) ; sum(x) }, times=1000)
plot(mb$time)
quantile(mb$time * 1e-6, c(0, .5, .75, .90, .95, .99, 1))
# 0% 50% 75% 90% 95% 99% 100%
# 59.04446 82.15453 102.17522 180.36986 187.52667 233.42062 249.33970
diff(which(mb$time > quantile(mb$time, .99)))
# [1] 102 51 102 102 102 102 102 102 51
diff(which(mb$time > quantile(mb$time, .95)))
# [1] 6 41 4 47 4 40 7 4 47 4 33 14 4 47 4 47 4 47 4 47 4 47 4
6 41
#[26] 4 6 7 9 25 4 47 4 47 4 47 4 22 25 4 33 14 4 6 41 4 47 4
22
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Tue, Aug 22, 2017 at 5:53 AM, <raphael.felber at agroscope.admin.ch> wrote:
> Dear all
>
> I was thinking about efficient reading data into R and tried several ways
> to test if load(file.Rdata) or readRDS(file.rds) is faster. The files
> file.Rdata and file.rds contain the same data, the first created with
> save(d, ' file.Rdata', compress=F) and the second with saveRDS(d, '
> file.rds', compress=F).
>
> First I used the function microbenchmark() and was a astonished about the
> max value of the output.
>
> FIRST TEST:
> > library(microbenchmark)
> > microbenchmark(
> + n <- readRDS('file.rds'),
> + load('file.Rdata')
> + )
> Unit: milliseconds
> expr min lq
> mean median uq
> max neval
> n <- readRDS(fl1) 106.5956 109.6457 237.3844
> 117.8956 141.9921 10934.162 100
> load(fl2) 295.0654 301.8162
> 335.6266 308.3757 319.6965 1915.706
> 100
>
> It looks like the max value is an outlier.
>
> So I tried:
> SECOND TEST:
> > sapply(1:10, function(x) system.time(n <- readRDS('file.rds'))[3])
> elapsed elapsed elapsed elapsed
> elapsed elapsed elapsed
> elapsed elapsed elapsed
> 10.50 0.11 0.11
> 0.11 0.10 0.11
> 0.11 0.11 0.12
> 0.12
> > sapply(1:10, function(x) system.time(load'flie.Rdata'))[3])
> elapsed elapsed elapsed elapsed
> elapsed elapsed elapsed
> elapsed elapsed elapsed
> 1.86 0.29 0.31
> 0.30 0.30 0.31
> 0.30 0.29 0.31
> 0.30
>
> Which confirmed my suspicion; the first time loading the data takes much
> longer than the following times. I suspect that this has something to do
> how the data is assigned and that R doesn't has to 'fully' read the data,
> if it is read the second time.
>
> So the question remains, how can I make a realistic benchmark test? From
> the first test I would conclude that reading the *.rds file is faster. But
> this holds only for a large number of neval. If I set times = 1 then
> reading the *.Rdata would be faster (as also indicated by the second test).
>
> Thanks for any help or comments.
>
> Kind regards
>
> Raphael
> ------------------------------------------------------------
> ------------------------
> Raphael Felber, PhD
> Scientific Officer, Climate & Air Pollution
>
> Federal Department of Economic Affairs,
> Education and Research EAER
> Agroscope
> Research Division, Agroecology and Environment
>
> Reckenholzstrasse 191, CH-8046 Zürich
> Phone +41 58 468 75 11
> Fax +41 58 468 72 01
> raphael.felber at agroscope.admin.ch<mailto:raphael.felber at agroscope.admin.ch
> >
> www.agroscope.ch<http://www.agroscope.ch/>
>
>
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list