[R-sig-hpc] Big Data packages

Brian G. Peterson brian at braverock.com
Thu Mar 18 17:10:48 CET 2010


I think if you are looking for a matrix-like replacement, you should 
probably look at Jeff Ryan (author of xts, quantmod, others)
'indexing' package.  It is very 'R-like' in its usage and subsetting, 
holding the 'index' in memory.  It turns out to be faster than bigmemory 
for most types of access.

  - Brian

Andrew Piskorski wrote:
> On Wed, Mar 17, 2010 at 04:26:16PM -0400, Ashwin Kapur wrote:
>   
>> Just wondering if anyone has opinions on the various big data packages for
>> R, ff vs bigmemory vs anything else.  Is anyone working on or is there
>>     
>
> I don't really know.  However, since both ff and bigmemory are
> intended for use with giant larger-than-RAM matrices via memory-mapped
> files on disk, back c. October 2009 I briefly tried out both in order
> to answer one question:
>
> Is either package a straightforward drop-in replacement for EXISTING
> code manipulating large R matrices, in order to reduce R's massive
> (and probably quite inefficient) memory use in such cases?
>
> The short answer is no, they're not.  Neither one even really attempts
> to work transparently as a matrix in R.  Both packages have major
> quirks and special behaviors which in practice seem to mean that you
> must write your code specifically for them.  These include smaller
> things like is.na() or apply() not working, to conceptually bigger
> ones like pass-by-reference rather than the pass-by-value R uses
> everywhere else.
>
> And if you're writing special-case code, then other tools, like
> RSQLite or perhaps even Metakit, also become options.  Note that I
> have no particular opinion on how useful ff or bigmemory are in
> general, I didn't even attempt to figure that out.
>
> And finally, some other out-there technologies to keep an eye on for
> potential use in massive data manipulation in R (but unlike the
> packages above, these probably are not usable with R right now):
>
> - If completed, Jean-Claude Wippler's Vlerq might well have been very
>   useful for R, perhaps even as a unification of and upgrade to R's
>   native matrix, array, and data frame data structures.  Unfortunately
>   that project is dead.  It also sounded in some ways like what Kdb/Q do.
>
> - MonetDB is interesting, but may be too server-like for embedded use
>   from R.
>
> - Alex van Ballegooij's "RAM" Relational Array Mapping extension for
>   MonetDB sounds potentially relevant for R-like use of matrices, but
>   it's not clear whether it actually worked for anything other than
>   his PhD thesis.
>   http://www.cwi.nl/en/2009/1026/New-array-database-technology-for-scientists
>
> - If SciDB gets anywhere, it might end up useful as an out-of-core
>   multi-dimensional matrix back-end for R, even though it is intended
>   more as an RDBMS server rather than a lightweight library.
>
>   


-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock



More information about the R-sig-hpc mailing list