[R] FW: Large datasets in R
tlumley at u.washington.edu
Tue Jul 18 17:34:02 CEST 2006
On Tue, 18 Jul 2006, Ritwik Sinha wrote:
> I have a related question. How differently do other statistical
> softwares handle large data?
> The original post claims that 350 MB is fine on Stata. Some one
> suggested S-Plus. I have heard people say that SAS can handle large
> data sets. Why can others do it and R seem to have a problem? Don't
> these softwares load the data onto RAM.
Stata does load the data into RAM and does have limits for the same reason
that R does. However, Stata has a less flexible representation of its data
(basically one rectangular dataset) and so it can handle somewhat larger
data sets for any given memory size. For example, even with 512Gb of
memory a 350Mb data set might be usable in Stata and with 1Gb it would
certainly be. Stata is also faster for a given memory load, apparently
because of its simpler language design [some evidence for this is that the
recent language additions to support flexible graphics run rather more
slowly than eg lattice in R].
The other approach is to write the estimation routines so that only part
of the data need be in memory at a given time. *Some* procedures in SAS
and SPSS work this way, and this is the idea of the S-PLUS 7.0 system for
handling large data sets. This approach requires the programmer to
handle the reading of sections of code from memory, something that can
only be automated to a limited extent.
People have used R in this way, storing data in a database and reading it
as required. There are also some efforts to provide facilities to support
this sort of programming (such as the current project funded by Google
Summer of Code: http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html).
One reason there isn't more of this is that relying on Moore's Law has
worked very well over the years.
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help