[R] Can R handle medium and large size data sets?
Philippe Grosjean
phgrosjean at sciviews.org
Tue Jan 24 16:03:22 CET 2006
Hello,
This is not true that R cannot handle matrices of 100 000's
observations... but:
- Importation (typically using read.table() and the like) "saturates"
much faster. Solution: use scan() and fill a preallocated matrix, or
better, use a database.
- Data frames are very nice objects, but if you handle only numeric
data, do prefer matrices: they consume less memory. Also, avoid using
row/column names for very large matrices/data frames.
- Finally, of course, your mileage varies greatly depending on the
calculation you do on your data.
In general, the relatively widely admitted idea that R cannot handle
large datasets originates from: using read.table() / data frames / non
optimized code.
As an example, I can create a matrix of 150 000 observations (you don't
tell us how many variables, so, I took 20 columns) filled with random
numbers, and calculate the mean for each variable very easily. Here it is:
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 168994 4.6 350000 9.4 350000 9.4
Vcells 62415 0.5 786432 6.0 290343 2.3
> system.time(a <- matrix(runif(150000 * 20), ncol = 20))
[1] 0.48 0.05 0.55 NA NA
> # Just a little bit more than half a second to create a table of
> # 3 millions entries filled with random numbers (P IV, 3Ghz, Win XP)
> dim(a)
[1] 150000 20
> system.time(print(colMeans(a)))
[1] 0.4998859 0.5004760 0.4994155 0.5000711 0.5005029
[6] 0.4999672 0.5003233 0.5000419 0.4997827 0.5004858
[11] 0.5004905 0.4993428 0.4991187 0.5000143 0.5016212
[16] 0.4988943 0.4990586 0.5009718 0.4997235 0.5001220
[1] 0.03 0.00 0.03 NA NA
> # 30 milliseconds to calculate the mean of all 20
> # variables over 150 000 observations
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 169514 4.6 350000 9.4 350000 9.4
Vcells 3062785 23.4 9317558 71.1 9062793 69.2
> # Less than 30 Mb used (with a peak at 80 Mb)
Isn't it manageable?
Best,
Philippe Grosjean
Gueorgui Kolev wrote:
> Dear R experts,
>
> Is it true that R generally cannot handle medium sized data sets(a
> couple of hundreds of thousands observations) and threrefore large
> date set(couple of millions of observations)?
>
> I googled and I found lots of questions regarding this issue, but
> curiously there were no straightforward answers what can be done to
> make R capable of handling data.
>
> Is there sth inherent in the structure of R that makes it impossible
> to work with say 100 000observations and more? If it is so, is there
> any hope that R can be fixed in the future?
>
> My experience is rather limited---I tried to load a Stata data set of
> about 150000observations(which Stata handles instantly) using the
> library "foreign". After half an hour R was still "thinking" so I
> stopped the attempts.
>
> Thank you in advance,
>
> Gueorgui Kolev
>
> Department of Economics and Business
> Universitat Pompeu Fabra
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
More information about the R-help
mailing list