[R] FW: Large datasets in R
François Pinard
pinard at iro.umontreal.ca
Wed Jul 19 00:56:26 CEST 2006
[Thomas Lumley]
>People have used R in this way, storing data in a database and reading it
>as required. There are also some efforts to provide facilities to support
>this sort of programming (such as the current project funded by Google
>Summer of Code: http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html).
Interesting project indeed! However, if R requires uses more swapping
because arrays do not all fit in physical memory, crudely replacing
swapping with database accesses is not necessarily going to buy
a drastic speed improvement: the paging gets done in user space instead
of being done in the kernel.
Long ago, while working on CDC mainframes, astonishing at the time but
tiny by nowadays standards, there was a program able to invert or do
simplexes on very big matrices. I do not remember the name of the
program, and never studied it but superficially (I was in computer
support for researchers, but not a researcher myself). The program was
documented as being extremely careful at organising accesses to rows and
columns (or parts thereof) in such a way that real memory was best used.
In other words, at the core of this program was a paging system very
specialised and cooperative with the problems meant to be solved.
However, the source of this program was just plain huge (let's say from
memory, about three or four times the size of the optimizing FORTRAN
compiler, which I already knew better as an impressive algorithmic
undertaking). So, good or wrong, the prejudice stuck solidly in me at
the time, if nothing else, that handling big arrays the right way,
speed-wise, ought to be very difficult.
>One reason there isn't more of this is that relying on Moore's Law has
>worked very well over the years.
On the other hand, the computational needs for scientific problems grow
fairly quickly to the size of our ability to solve them. Let me take
weather forecasting for example. 3-D geographical grids are never fine
enough for the resolution meteorologists would like to get, and the time
required for each prediction step grows very rapidly, to increase
precision by not so much. By merely tuning a few parameters, these
people may easily pump nearly all the available cycles out the
supercomputers given to them, and they do so without hesitation.
Moore's Law will never succeed at calming their starving hunger! :-).
--
François Pinard http://pinard.progiciels-bpi.ca
More information about the R-help
mailing list