[R] Very Large Data Sets

Loren M. McCarter lorenmc at socrates.berkeley.edu
Thu Dec 23 18:33:47 CET 1999

On Wed, 22 Dec 1999, Tony Fagan wrote:

> List,
> Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily.
> Thanks,
> Tony Fagan

There have been a couple of posts about approaching this large-dataset
problem with the MySQL/Python/R combination. I will simply add some
information (a testimonial) about my experiences with this as a possible
solution. This combination has worked very, very well for me. As a former
SAS and Windows user, I decided to perform my dissertation data analyses
using FreeBSD, which does not run SAS. After about a year of tinkering
around with different ways to approach the problem of analyzing my
dissertation data (i.e., moderately large ~1.5 million obs of
psychophysiological data), I have settled on this MySQL/Python/R
combination. In order to get to this stage, I looked into several other
solutions (e.g., Perl Data Language, PostgreSQL, Ox, APL, Perl, etc.), but
this combination met my needs best. 

For my purposes, I find this solution to be better than any other 
(including SAS). MySQL is very, very fast, especially when using
an index. Just last night, I could not believe how quickly it created
an R dataset for me (only 30 seconds on an slow machine---486DX
66Mhz---for a complex join of four tables, each table containing about
500K rows). For most data-analytic purposes, I go directly from (1)
subsetting the data in MySQL to (2) performing more sophisticated data
analyses in R. For some more complex queries, the Python
link is needed, but not for most (Python, of course, is useful for many
other reasons than linking from MySQL to R).

For my dissertation data, there is no reason for me to analyze all 1.5 
million rows at once. Rather, I need to perform the same statistical procedures,
one or two subjects at a time (i.e., 2400 rows), over and over again. I
let the SQL backend do the large, number-crunching work and let R shine
for statistics, and it really does shine...

Testimonially yours,



Loren Michael McCarter
Graduate Student-UC Berkeley

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list