[R-sig-hpc] Big Data packages

Andrew Piskorski atp at piskorski.com
Thu Mar 18 16:57:40 CET 2010


On Wed, Mar 17, 2010 at 04:26:16PM -0400, Ashwin Kapur wrote:
> Just wondering if anyone has opinions on the various big data packages for
> R, ff vs bigmemory vs anything else.  Is anyone working on or is there

I don't really know.  However, since both ff and bigmemory are
intended for use with giant larger-than-RAM matrices via memory-mapped
files on disk, back c. October 2009 I briefly tried out both in order
to answer one question:

Is either package a straightforward drop-in replacement for EXISTING
code manipulating large R matrices, in order to reduce R's massive
(and probably quite inefficient) memory use in such cases?

The short answer is no, they're not.  Neither one even really attempts
to work transparently as a matrix in R.  Both packages have major
quirks and special behaviors which in practice seem to mean that you
must write your code specifically for them.  These include smaller
things like is.na() or apply() not working, to conceptually bigger
ones like pass-by-reference rather than the pass-by-value R uses
everywhere else.

And if you're writing special-case code, then other tools, like
RSQLite or perhaps even Metakit, also become options.  Note that I
have no particular opinion on how useful ff or bigmemory are in
general, I didn't even attempt to figure that out.

And finally, some other out-there technologies to keep an eye on for
potential use in massive data manipulation in R (but unlike the
packages above, these probably are not usable with R right now):

- If completed, Jean-Claude Wippler's Vlerq might well have been very
  useful for R, perhaps even as a unification of and upgrade to R's
  native matrix, array, and data frame data structures.  Unfortunately
  that project is dead.  It also sounded in some ways like what Kdb/Q do.

- MonetDB is interesting, but may be too server-like for embedded use
  from R.

- Alex van Ballegooij's "RAM" Relational Array Mapping extension for
  MonetDB sounds potentially relevant for R-like use of matrices, but
  it's not clear whether it actually worked for anything other than
  his PhD thesis.
  http://www.cwi.nl/en/2009/1026/New-array-database-technology-for-scientists

- If SciDB gets anywhere, it might end up useful as an out-of-core
  multi-dimensional matrix back-end for R, even though it is intended
  more as an RDBMS server rather than a lightweight library.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/



More information about the R-sig-hpc mailing list