[R] FW: Large datasets in R

Tue Jul 18 17:34:02 CEST 2006

On Tue, 18 Jul 2006, Ritwik Sinha wrote:

> Hi,
>
> I have a related question. How differently do other statistical
> softwares handle large data?
>
> The original post claims that 350 MB is fine on Stata. Some one
> suggested S-Plus. I have heard people say that SAS can handle large
> data sets. Why can others do it and R seem to have a problem? Don't
> these softwares load the data onto RAM.
>

Stata does load the data into RAM and does have limits for the same reason 
that R does. However, Stata has a less flexible representation of its data 
(basically one rectangular dataset) and so it can handle somewhat larger 
data sets for any given memory size. For example, even with 512Gb of 
memory a 350Mb data set might be usable in Stata and with 1Gb it would 
certainly be. Stata is also faster for a given memory load, apparently 
because of its simpler language design [some evidence for this is that the 
recent language additions to support flexible graphics run rather more 
slowly than eg lattice in R].

The other approach is to write the estimation routines so that only part 
of the data need be in memory at a given time.  *Some* procedures in SAS 
and SPSS work this way, and this is the idea of the S-PLUS 7.0 system for 
handling large data sets.   This approach requires the programmer to 
handle the reading of sections of code from memory, something that can 
only be automated to a limited extent.

People have used R in this way, storing data in a database and reading it 
as required. There are also some efforts to provide facilities to support 
this sort of programming (such as the current project funded by Google 
Summer of Code:  http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). 
One reason there isn't more of this is that relying on Moore's Law has 
worked very well over the years.

          -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle