[R] FW: Large datasets in R

François Pinard pinard at iro.umontreal.ca
Wed Jul 19 00:56:26 CEST 2006

[Thomas Lumley]

>People have used R in this way, storing data in a database and reading it 
>as required. There are also some efforts to provide facilities to support 
>this sort of programming (such as the current project funded by Google 
>Summer of Code:  http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 

Interesting project indeed!  However, if R requires uses more swapping 
because arrays do not all fit in physical memory, crudely replacing 
swapping with database accesses is not necessarily going to buy
a drastic speed improvement: the paging gets done in user space instead 
of being done in the kernel.

Long ago, while working on CDC mainframes, astonishing at the time but 
tiny by nowadays standards, there was a program able to invert or do 
simplexes on very big matrices.  I do not remember the name of the 
program, and never studied it but superficially (I was in computer 
support for researchers, but not a researcher myself).  The program was 
documented as being extremely careful at organising accesses to rows and 
columns (or parts thereof) in such a way that real memory was best used.
In other words, at the core of this program was a paging system very 
specialised and cooperative with the problems meant to be solved.

However, the source of this program was just plain huge (let's say from 
memory, about three or four times the size of the optimizing FORTRAN 
compiler, which I already knew better as an impressive algorithmic 
undertaking).  So, good or wrong, the prejudice stuck solidly in me at 
the time, if nothing else, that handling big arrays the right way, 
speed-wise, ought to be very difficult.

>One reason there isn't more of this is that relying on Moore's Law has 
>worked very well over the years.

On the other hand, the computational needs for scientific problems grow 
fairly quickly to the size of our ability to solve them.  Let me take
weather forecasting for example.  3-D geographical grids are never fine 
enough for the resolution meteorologists would like to get, and the time 
required for each prediction step grows very rapidly, to increase 
precision by not so much.  By merely tuning a few parameters, these 
people may easily pump nearly all the available cycles out the 
supercomputers given to them, and they do so without hesitation.  
Moore's Law will never succeed at calming their starving hunger! :-).

François Pinard   http://pinard.progiciels-bpi.ca

More information about the R-help mailing list