[R] Can R handle medium and large size data sets?
Thomas Lumley
tlumley at u.washington.edu
Tue Jan 24 17:06:58 CET 2006
On Tue, 24 Jan 2006, Gueorgui Kolev wrote:
> Dear R experts,
>
> Is it true that R generally cannot handle medium sized data sets(a
> couple of hundreds of thousands observations) and threrefore large
> date set(couple of millions of observations)?
>
> I googled and I found lots of questions regarding this issue, but
> curiously there were no straightforward answers what can be done to
> make R capable of handling data.
Because it depends on the situation.
> My experience is rather limited---I tried to load a Stata data set of
> about 150000observations(which Stata handles instantly) using the
> library "foreign". After half an hour R was still "thinking" so I
> stopped the attempts.
Like Stata, R prefers to store all the data in memory, but because of R's
flexibility it takes more memory than Stata does, and for simple analyses
is slower. For simple analyses Stata probably needs only 10-20% as much
memory as R on a given data set.
If you have a 64-bit version of R it can handle quite large data sets,
certainly millions of records. On the other hand an ordinary PC might
well start to slow down noticeably with a few tens of thousands of
reasonably complex records.
Often it is not necessary to store all the data in memory at once, and
there are database interfaces to make this easier.
R (and S before it) have generally assumed that increasing computer power
will solve a lot of problems more easily than programming would, and have
generally been correct.
If you want Stata, you know where to find it (and it's a good choice for
many problems).
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list