[R] FW: Large datasets in R
gunter.berton at gene.com
Wed Jul 19 01:19:43 CEST 2006
Or, more succinctly, "Pinard's Law":
The demands of ever more data always exceed the capabilities of ever better
-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of François Pinard
> Sent: Tuesday, July 18, 2006 3:56 PM
> To: Thomas Lumley
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] FW: Large datasets in R
> [Thomas Lumley]
> >People have used R in this way, storing data in a database
> and reading it
> >as required. There are also some efforts to provide
> facilities to support
> >this sort of programming (such as the current project funded
> by Google
> >Summer of Code:
> Interesting project indeed! However, if R requires uses more
> because arrays do not all fit in physical memory, crudely replacing
> swapping with database accesses is not necessarily going to buy
> a drastic speed improvement: the paging gets done in user
> space instead
> of being done in the kernel.
> Long ago, while working on CDC mainframes, astonishing at the
> time but
> tiny by nowadays standards, there was a program able to invert or do
> simplexes on very big matrices. I do not remember the name of the
> program, and never studied it but superficially (I was in computer
> support for researchers, but not a researcher myself). The
> program was
> documented as being extremely careful at organising accesses
> to rows and
> columns (or parts thereof) in such a way that real memory was
> best used.
> In other words, at the core of this program was a paging system very
> specialised and cooperative with the problems meant to be solved.
> However, the source of this program was just plain huge
> (let's say from
> memory, about three or four times the size of the optimizing FORTRAN
> compiler, which I already knew better as an impressive algorithmic
> undertaking). So, good or wrong, the prejudice stuck solidly
> in me at
> the time, if nothing else, that handling big arrays the right way,
> speed-wise, ought to be very difficult.
> >One reason there isn't more of this is that relying on
> Moore's Law has
> >worked very well over the years.
> On the other hand, the computational needs for scientific
> problems grow
> fairly quickly to the size of our ability to solve them. Let me take
> weather forecasting for example. 3-D geographical grids are
> never fine
> enough for the resolution meteorologists would like to get,
> and the time
> required for each prediction step grows very rapidly, to increase
> precision by not so much. By merely tuning a few parameters, these
> people may easily pump nearly all the available cycles out the
> supercomputers given to them, and they do so without hesitation.
> Moore's Law will never succeed at calming their starving hunger! :-).
> François Pinard http://pinard.progiciels-bpi.ca
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide!
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help