[Rd] Rapid Random Access
Barry Rowlingson
b.rowlingson at lancaster.ac.uk
Fri Dec 14 19:01:30 CET 2007
I have some code that can potentially produce a huge number of
large-ish R data frames, each of a different number of rows. All the
data frames together will be way too big to keep in R's memory, but
we'll assume a single one is manageable. It's just when there's a
million of them that the machine might start to burn up.
However I might, for example, want to compute some averages over the
elements in the data frames. Or I might want to sample ten of them at
random and do some plots. What I need is rapid random access to data
stored in external files.
Here's some ideas I've had:
* Store all the data in an HDF-5 file - problem here is that the
current HDF package for R reads the whole file in at once.
* Store the data in some other custom binary format with an index for
rapid access to the N-th elements. Problems: feels like reinventing HDF,
cross-platform issues, etc.
* Store the data in a number of .RData files in a directory. Hence to
get the N-th element just attach(paste("foo/A-",n,'.RData')) give or
take a parameter or two.
* Use a database. Seems a bit heavyweight, but maybe using RSQLite
could work in order to keep it local.
What I'm currently doing is keeping it OO enough that I can in theory
implement all of the above. At the moment I have an implementation that
does keep them all in R's memory as a list of data frames, which is fine
for small test cases but things are going to get big shortly. Any other
ideas or hints are welcome.
thanks
Barry
More information about the R-devel
mailing list