[R-sig-hpc] Antw: Big Data packages

Ashwin Kapur ashwin.kapur at gmail.com
Thu Mar 18 17:21:42 CET 2010

I've been looking at big data packages on CRAN though I haven't looked at RForge much yet.  So far I've played with bigmemory and ff.  Both are great packages, though ff seemed closer to what would work for me.  I have code that uses large multi (3) dimensional arrays and bigmemory doesn't do more than 2 dimensions.  Performance for filebacked matrices seems to be quite a bit slower, though admittedly I've only been playing with the packages for a short time so that probably has more to do with it than the packages themselves.

From what I can see, for file backed matrices neither ff nor bigmemory seemed to take advantage of the fact that using some CPU to make disk access faster actually makes things MUCH faster because disk access is a few orders of magnitude than anything else. So basically compressing data is a huge win because you do much less disk access.  Compression is sometimes considered an issue because for the better compression algorithms you need to compress the whole matrix and uncompress the whole thing as well so for huge matrices you can end up swapping while compressing and decompressing, but the better scientific data handling libraries get around this by asking you to specify a chunk size, so say for a 500x10000x400 array you specify a chunksize of 1x10000x400 and the library compresses chunks of data of that size and indexes them.  You generally choose chunks such that the typical amount of data you handle at one time is one chunk.  Other optimizations are also generally done.

One other important optimization is transposing the binary data.  Typical scientific datasets are made up of numbers relatively close to each other.  If you consider how they are represented on the machine, you could easily have a dataset for which the higher order bits are identical.  Now when you consider that most compression algorithms are some variant of run length encoding there is an obvious optimization.  If you try to compress the data as is the compressor can't do all that much.  However if you essentially transpose the binary vectors, so you have all the highest order bits first, then the next ones etc if say all the highest order bits are the same, they can be stored in essentially one size_t and so on.  Again this is memory intensive, but chunking and compressing and saving chunks separately gets around this problem.  So instead of the typical 3x - 4x compression, typical compression goes to about 10x and for some datasets much higher which translates into much higher read and write speed.

Of course the "obvious" optimizations like sparse storage for sparse matrices etc are important too.

Other languages I use for HPC generally have interfaces to the various data handling libraries, notably hdf5 and netcdf so you can use multiple languages to interact with the same datasets.  To me a package that wraps hdf5 is an obvious need, especially now that it seems clear that hdf5 is probably the "best" (tm) such library and even the netcdf team agrees, having made the latest version is netcdf is just a wrapper over hdf5.  From what I can tell there are a couple of R packages that can read hdf5 but as far as I can tell neither actually provides an interface that is useful for HPC applications, just a way of taking a hdf5 file and putting the whole thing into memory which is impossible if it's a truly large file.

I am aware of sqlite etc but they really aren't designed for doing file backed data storage for HPC applications.  They are rather more general libraries.  I'm wondering if there are packages out there that connect R to hdf5 etc out there that people are aware of.  I don't want to start on a package writing adventure if it's already been done.  And of course I'm aware that creating a package of this sort that creates data types that can be used as simple drop in replacements for standard R matrices or arrays is let's just say likely to be complicated without rewriting R.

Thanks for the pointers to Metakit, RAM and SciDB.  I'll play with them for a bit.


More information about the R-sig-hpc mailing list