[R-sig-hpc] lazy load (and unload?) elements of large list
Gregory Jefferis
jefferis at gmail.com
Mon Nov 19 08:57:35 CET 2012
Dear Saptarshi,
Thanks for your response.
On 17 Nov 2012, at 05:42, Saptarshi Guha wrote:
> One question: how large is each list element? < 256MB?
fairly small (<5Mb)
> One approach: store your data in Hbase (or Hadoop MapFiles), with key
> as the list index.
Can Hbase cope with arbitrary format data (like a list) rather than just tables?
>
> The define an object O of class C. Redefine "[[".C as a function that
> reads HBase/HDFS queries for the list index i (e.g. as in x[[i]]) and
> retrieve the i'th list element.
Yes, that was the sort of thing I had in mind.
> Cache this, so the second x[[i]] is
> called, it will retreive it from the cache. To prevent the cache from
> expanding to 2GB, you can keep last K cache entries (some MRU/LRU type
> cache retention scheme).
I was hoping someone might have done some work on this type of strategy that could be reused!
> Not sure how you intend to use this list, the application above
> handles the query of some subset of keys (indices 1...16K).
> Do you want to run some function F across all/large subset of keys?
> This is a good case for R and Hadoop.
I have 2 use cases – one batch processing on a cluster where all elements will be processed, typically in a function that takes a pair of elements as inputs and computes scores for all 16k^2 combinations; this is much faster when. For the time being simple approaches seem to work here, but Hadoop is something I have been meaning to investigate for the longer term. The second main use case, interactive use on single user machine, where I am more concerned about saturating memory, is what I am trying to address currently.
Best wishes,
Greg.
More information about the R-sig-hpc
mailing list