[R-sig-hpc] Antw: Big Data packages

Thu Mar 18 18:07:22 CET 2010

The 'indexing' package on R-forge is very much in alpha, but it does
some things quite well now.

The definition of 'big data' is obviously user-driven.  10GB probably
requires very different thinking than 10TB.

I designed 'indexing' and 'mmap' for the former, on up to hopefully
the TB range.  The limitation at present is simply that it relies on a
single system to run, no communication channel yet.

In a nutshell, mmap is a high-level wrapper (R) to a low-level sys
call (mmap) that is at the heart of every database. Currently supports
on-the-fly conversion from machine types of char/1,2,3 and 4 byte
signed and unsigned ints, 32 and 64 bit floats, complex and raw.
There is even support for C-style structs, to effectively implement
'row' storage (keeping data on one page in memory).  This is most
similar to ff or bigmemory in design goal, but quite a bit different
in implementation.  The on-disk structure is nothing but a byte
string.  No metadata, etc.  The idea is for ultimate portability
within a system, since the filebacked objects can be shared.

Caveats to mmap at present (March 2010) are no windows support and it
is little-endian.  But those are reasonable shortcomings given the
prevalence of x86 and *nix in HPC and 'real' computing. ;-)

The 'indexing' package provides simple semantics to do exactly that,
index and search data. It uses mmap-ed objects (though multiple
backends are in the works, including just memory) and
can index any vector ... be it stand-alone or part of another data
object (data.frame, matrix, xts, zoo, ...)

The design of indexing was to make subsetting (a requirement of really
big data) as R-like as possible:

db <- loadIndex("symbols")
loadIndex("delta", double())
loadIndex("bid", int32(), extractFUN=function(x) x/100)
loadIndex("ask", int32(), extractFUN=function(x) x/100)

db[symbols=="AAPL", data.frame(symbol, delta, spread=ask-bid)]

works

So does:

lapply(c("AAPL","CSCO"), function(x) db[symbols==x,
data.frame(symbols, delta)[delta > 0.5,]])

All the above use the index (if available) and are very fast.  70
million obs data set of end-of-day option prices (19 rows) (3GB + 3GB
indexing), requires no memory to speak of and 0.018s to return 90k
rows (AAPL example).  All on a laptop with a little 5400 rpm drive.
Indexing supports RLE encoding if appropriate, and will eventually do
more along the lines of compression.

Naïve timings against MonetDB, MySQL etc are very good. (read: much better).

Again, serious 'work-in-progress', but it is being used by me daily to
get at data that is otherwise impossible to use in R.  Feedback and
testing from others is very welcomed...

RBerkeley (also my code) is quite useful as well, and has most of the
functionality that comes with the DB software.  70+ functions of the
API.  BerkeleyDB is a nice key-value store that is very fast (despite
what the new kids in the NoSQL crowd say), and has a long history of
deployment and use.

HTH
Jeff

On Thu, Mar 18, 2010 at 11:21 AM, Ashwin Kapur <ashwin.kapur at gmail.com> wrote:
> I've been looking at big data packages on CRAN though I haven't looked at RForge much yet.  So far I've played with bigmemory and ff.  Both are great packages, though ff seemed closer to what would work for me.  I have code that uses large multi (3) dimensional arrays and bigmemory doesn't do more than 2 dimensions.  Performance for filebacked matrices seems to be quite a bit slower, though admittedly I've only been playing with the packages for a short time so that probably has more to do with it than the packages themselves.
>
> From what I can see, for file backed matrices neither ff nor bigmemory seemed to take advantage of the fact that using some CPU to make disk access faster actually makes things MUCH faster because disk access is a few orders of magnitude than anything else. So basically compressing data is a huge win because you do much less disk access.  Compression is sometimes considered an issue because for the better compression algorithms you need to compress the whole matrix and uncompress the whole thing as well so for huge matrices you can end up swapping while compressing and decompressing, but the better scientific data handling libraries get around this by asking you to specify a chunk size, so say for a 500x10000x400 array you specify a chunksize of 1x10000x400 and the library compresses chunks of data of that size and indexes them.  You generally choose chunks such that the typical amount of data you handle at one time is one chunk.  Other optimizations are also generally done.
>
> One other important optimization is transposing the binary data.  Typical scientific datasets are made up of numbers relatively close to each other.  If you consider how they are represented on the machine, you could easily have a dataset for which the higher order bits are identical.  Now when you consider that most compression algorithms are some variant of run length encoding there is an obvious optimization.  If you try to compress the data as is the compressor can't do all that much.  However if you essentially transpose the binary vectors, so you have all the highest order bits first, then the next ones etc if say all the highest order bits are the same, they can be stored in essentially one size_t and so on.  Again this is memory intensive, but chunking and compressing and saving chunks separately gets around this problem.  So instead of the typical 3x - 4x compression, typical compression goes to about 10x and for some datasets much higher which translates into much !
>  higher read and write speed.
>
> Of course the "obvious" optimizations like sparse storage for sparse matrices etc are important too.
>
> Other languages I use for HPC generally have interfaces to the various data handling libraries, notably hdf5 and netcdf so you can use multiple languages to interact with the same datasets.  To me a package that wraps hdf5 is an obvious need, especially now that it seems clear that hdf5 is probably the "best" (tm) such library and even the netcdf team agrees, having made the latest version is netcdf is just a wrapper over hdf5.  From what I can tell there are a couple of R packages that can read hdf5 but as far as I can tell neither actually provides an interface that is useful for HPC applications, just a way of taking a hdf5 file and putting the whole thing into memory which is impossible if it's a truly large file.
>
> I am aware of sqlite etc but they really aren't designed for doing file backed data storage for HPC applications.  They are rather more general libraries.  I'm wondering if there are packages out there that connect R to hdf5 etc out there that people are aware of.  I don't want to start on a package writing adventure if it's already been done.  And of course I'm aware that creating a package of this sort that creates data types that can be used as simple drop in replacements for standard R matrices or arrays is let's just say likely to be complicated without rewriting R.
>
> Thanks for the pointers to Metakit, RAM and SciDB.  I'll play with them for a bit.
>
> --Ashwin
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>

-- 
Jeffrey Ryan
jeffrey.ryan at insightalgo.com

ia: insight algorithmics
www.insightalgo.com