[Rd] Request to speed up save()
Simon Urbanek
simon.urbanek at r-project.org
Thu Jan 15 20:08:58 CET 2015
In addition to the major points that others made: if you care about speed, don't use compression. With today's fast disks it's an order of magnitude slower to use compression:
> d=lapply(1:10, function(x) as.integer(rnorm(1e7)))
> system.time(saveRDS(d, file="test.rds.gz"))
user system elapsed
17.210 0.148 17.397
> system.time(saveRDS(d, file="test.rds", compress=F))
user system elapsed
0.482 0.355 0.929
The above example is intentionally well compressible, in real life the differences are actually even bigger. As people that deal with big data know well, disks are no longer the bottleneck - it's the CPU now.
Cheers,
Simon
BTW: why in the world would you use ascii=TRUE? It's pretty much the slowest possible serialization you can use - it will even overshadow compression:
> system.time(saveRDS(d, file="test.rds", compress=F))
user system elapsed
0.459 0.383 0.940
> system.time(saveRDS(d, file="test-a.rds", compress=F, ascii=T))
user system elapsed
36.713 0.140 36.929
and the same goes for reading:
> system.time(readRDS("test-a.rds"))
user system elapsed
27.616 0.275 27.948
> system.time(readRDS("test.rds"))
user system elapsed
0.609 0.184 0.795
> On Jan 15, 2015, at 7:45 AM, Stewart Morris <Stewart.Morris at igmm.ed.ac.uk> wrote:
>
> Hi,
>
> I am dealing with very large datasets and it takes a long time to save a workspace image.
>
> The options to save compressed data are: "gzip", "bzip2" or "xz", the default being gzip. I wonder if it's possible to include the pbzip2 (http://compression.ca/pbzip2/) algorithm as an option when saving.
>
> "PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer"
>
> I tested this as follows with one of my smaller datasets, having only read in the raw data:
>
> ============
> # Dumped an ascii image
> save.image(file='test', ascii=TRUE)
>
> # At the shell prompt:
> ls -l test
> -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test
>
> time bzip2 -9 test
> 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w
>
> time pbzip2 -9 test
> 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w
> ============
>
> As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 11 seconds, admittedly on a 64 core machine (running at 50% load). Most modern machines are multicore so everyone would get some speedup.
>
> Is this feasible/practical? I am not a developer so I'm afraid this would be down to someone else...
>
> Thoughts?
>
> Cheers,
>
> Stewart
>
> --
> Stewart W. Morris
> Centre for Genomic and Experimental Medicine
> The University of Edinburgh
> United Kingdom
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list