Stewart Morris Stewart.Morris at igmm.ed.ac.uk
Thu Jan 15 13:45:50 CET 2015


I am dealing with very large datasets and it takes a long time to save a 
workspace image.

The options to save compressed data are: "gzip", "bzip2" or "xz", the 
default being gzip. I wonder if it's possible to include the pbzip2 
(http://compression.ca/pbzip2/) algorithm as an option when saving.

"PBZIP2 is a parallel implementation of the bzip2 block-sorting file 
compressor that uses pthreads and achieves near-linear speedup on SMP 
machines. The output of this version is fully compatible with bzip2 
v1.0.2 or newer"

I tested this as follows with one of my smaller datasets, having only 
read in the raw data:

# Dumped an ascii image
save.image(file='test', ascii=TRUE)

# At the shell prompt:
ls -l test
-rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test

time bzip2 -9 test
364.702u 3.148s 6:14.01 98.3%	0+0k 48+1273976io 1pf+0w

time pbzip2 -9 test
422.080u 18.708s 0:11.49 3836.2%	0+0k 0+1274176io 0pf+0w

As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 
11 seconds, admittedly on a 64 core machine (running at 50% load). Most 
modern machines are multicore so everyone would get some speedup.

Is this feasible/practical? I am not a developer so I'm afraid this 
would be down to someone else...




