[R-sig-hpc] lazy load (and unload?) elements of large list

Gregory Jefferis jefferis at gmail.com
Fri Nov 16 21:55:10 CET 2012


Hello,

Short version:
==========

Can anyone recommend a method to load dynamically the elements of a large (multi-gigabyte) list on demand, where each list element is a complex list (S3 object) in its own right?

With many thanks,

Greg Jefferis.

Long version:
==========
I often use a data structure which is basically a list of lists. The sublists are in fact 3d reconstructions of fly neurons[1]. For one of my active projects the whole list is ~2Gb, divided into 16,000 sublists (ie neurons). I am getting to the stage where I need to figure out how to page this kind of data from disk when necessary, both for interactive data exploration and batch processing.

I know of ways to handle large file-backed matrices  e.g. big.matrix/ff packages.  I could potentially make a special backing matrix object that aggregated most of my list data into a huge matrix, cache that and then write intermediate code to fetch the right rows of the big matrix). I know that big.matrix file backed data seems to be cached pretty efficiently by the OS in the face of multiple reads. However this sounds quite fiddly and I would be keen to find a solution that

1) could handle lists natively (ie not just matrices)
2) offers the choice to read from disk OR lazy loads data into memory
3) could potentially unload individual sublists (and then reload when necessary) in order to keep memory usage reasonable (below some threshold?)
4) fast (within the limits imposed by OS etc)

I have looked at the filehash package but do not see a way to lazy load list elements (though I can put lots of elements into an environment after mangling their names) and then overload e.g. "[[" to  fetch them.

I have also looked at lazyLoad in base R (but that only works with individual objects so same issue as above) and has no option to load from disk only.

Could anyone advise strategies/packages to take a look at? There's so much stuff on and off CRAN these days that I feel there must be something relevant that I have missed.

[1] example rgl/webgl export for the curious. 

http://flybrain.mrc-lmb.cam.ac.uk/vfb/fc/clusters/FruMARCM-M002099_seg001/

this set is 31/16000 neurons and the raw data occupy 4335320 bytes in memory.

--
Gregory Jefferis, PhD                      
Division of Neurobiology                   
MRC Laboratory of Molecular Biology,       
Hills Road,                                
Cambridge, CB2 0QH, UK.                    

http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis
http://www.neuroscience.cam.ac.uk/directory/profile.php?gsxej2
http://flybrain.stanford.edu



More information about the R-sig-hpc mailing list