[R] FW: Large datasets in R

Wed Jul 19 01:19:43 CEST 2006

Or, more succinctly, "Pinard's Law":

The demands of ever more data always exceed the capabilities of ever better
hardware.

;-D

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of François Pinard
> Sent: Tuesday, July 18, 2006 3:56 PM
> To: Thomas Lumley
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] FW: Large datasets in R
> 
> [Thomas Lumley]
> 
> >People have used R in this way, storing data in a database 
> and reading it 
> >as required. There are also some efforts to provide 
> facilities to support 
> >this sort of programming (such as the current project funded 
> by Google 
> >Summer of Code:  
> http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 
> 
> Interesting project indeed!  However, if R requires uses more 
> swapping 
> because arrays do not all fit in physical memory, crudely replacing 
> swapping with database accesses is not necessarily going to buy
> a drastic speed improvement: the paging gets done in user 
> space instead 
> of being done in the kernel.
> 
> Long ago, while working on CDC mainframes, astonishing at the 
> time but 
> tiny by nowadays standards, there was a program able to invert or do 
> simplexes on very big matrices.  I do not remember the name of the 
> program, and never studied it but superficially (I was in computer 
> support for researchers, but not a researcher myself).  The 
> program was 
> documented as being extremely careful at organising accesses 
> to rows and 
> columns (or parts thereof) in such a way that real memory was 
> best used.
> In other words, at the core of this program was a paging system very 
> specialised and cooperative with the problems meant to be solved.
> 
> However, the source of this program was just plain huge 
> (let's say from 
> memory, about three or four times the size of the optimizing FORTRAN 
> compiler, which I already knew better as an impressive algorithmic 
> undertaking).  So, good or wrong, the prejudice stuck solidly 
> in me at 
> the time, if nothing else, that handling big arrays the right way, 
> speed-wise, ought to be very difficult.
> 
> >One reason there isn't more of this is that relying on 
> Moore's Law has 
> >worked very well over the years.
> 
> On the other hand, the computational needs for scientific 
> problems grow 
> fairly quickly to the size of our ability to solve them.  Let me take
> weather forecasting for example.  3-D geographical grids are 
> never fine 
> enough for the resolution meteorologists would like to get, 
> and the time 
> required for each prediction step grows very rapidly, to increase 
> precision by not so much.  By merely tuning a few parameters, these 
> people may easily pump nearly all the available cycles out the 
> supercomputers given to them, and they do so without hesitation.  
> Moore's Law will never succeed at calming their starving hunger! :-).
> 
> -- 
> François Pinard   http://pinard.progiciels-bpi.ca
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>