[R] Can R handle medium and large size data sets?

Thomas Lumley tlumley at u.washington.edu
Tue Jan 24 17:06:58 CET 2006


On Tue, 24 Jan 2006, Gueorgui Kolev wrote:

> Dear R experts,
>
> Is it true that R generally cannot handle  medium sized data sets(a
> couple of hundreds of thousands observations) and threrefore large
> date set(couple of millions of observations)?
>
> I googled and I found lots of questions regarding this issue, but
> curiously there were no straightforward answers what can be done to
> make R capable of handling data.

Because it depends on the situation.

> My experience is rather limited---I tried to load a Stata data set of
> about 150000observations(which Stata handles instantly) using the
> library "foreign". After half an hour R was still "thinking" so I
> stopped the attempts.

Like Stata, R prefers to store all the data in memory, but because of R's 
flexibility it takes more memory than Stata does, and for simple analyses 
is slower. For simple analyses Stata probably needs only 10-20% as much 
memory as R on a given data set.

If you have a 64-bit version of R it can handle quite large data sets, 
certainly millions of records.  On the other hand an ordinary PC might 
well start to slow down noticeably with a few tens of thousands of 
reasonably complex records.

Often it is not necessary to store all the data in memory at once, and 
there are database interfaces to make this easier.

R (and S before it) have generally assumed that increasing computer power 
will solve a lot of problems more easily than programming would, and have 
generally been correct.

If you want Stata, you know where to find it (and it's a good choice for 
many problems).


 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list