[R-sig-hpc] lazy load (and unload?) elements of large list

Gregory Jefferis jefferis at gmail.com
Mon Nov 19 08:57:35 CET 2012


Dear Saptarshi,

Thanks for your response.

On 17 Nov 2012, at 05:42, Saptarshi Guha wrote:

> One question: how large is each list element? < 256MB?

fairly small (<5Mb)

> One approach: store your data in Hbase (or Hadoop MapFiles), with key
> as the list index.

Can Hbase cope with arbitrary format data (like a list) rather than just tables?
> 
> The define an object O of class C. Redefine "[[".C as a function that
> reads HBase/HDFS queries for the list index i (e.g. as in x[[i]]) and
> retrieve the i'th list element.

Yes, that was the sort of thing I had in mind.

> Cache this, so the second x[[i]] is
> called, it will retreive it from the cache. To prevent the cache from
> expanding to 2GB, you can keep last K cache entries (some MRU/LRU type
> cache retention scheme).

I was hoping someone might have done some work on this type of strategy that could be reused!

> Not sure how you intend to use this list, the application above
> handles the query of some subset of keys (indices 1...16K). 
> Do you want to run some function F across all/large subset of keys?
> This is a good case for R and Hadoop.

I have 2 use cases – one batch processing on a cluster where all elements will be processed, typically in a function that takes a pair of elements as inputs and computes scores for all 16k^2 combinations; this is much faster when. For the time being simple approaches seem to work here, but Hadoop is something I have been meaning to investigate for the longer term.  The second main use case, interactive use on single user machine, where I am more concerned about saturating memory, is what I am trying to address currently.

Best wishes,

Greg.


More information about the R-sig-hpc mailing list