[R] Large data sets and aggregation

Wed Feb 2 18:33:22 CET 2000

One of the many items on my "To Do" list is writing a short primer on
using relational databases with R.  I am currently working with some
data that consists of hundreds of thousands of records on thousands of
variables.  Needless to say, we don't load that directly into R.

We have written several scripts in python to crack the original data
files (SAS packed data sets) and install the data into a relational
database.  The relational database system we use is MySQL
(www.mysql.com).  It provides powerful facilities for manipulating
data and performing operations like aggregation.  We have been
impressed with MySQL.  PostgreSQL is another possibility.  Timothy
Keitt has written an PostgreSQL interface package for R.

David James at Bell Labs has drafted an API for relational database
interfaces from S or R.  He also wrote an implementation of an
interface to MySQL for S, which Saikat DebRoy has modified for R.  I
understand that the RMySQL package should be uploaded to CRAN "real
soon".

This interface allows us to do all the large scale data manipulation
in MySQL then extract pieces for modeling within R.  There are many
advantages to this approach including speed, data persistence,
simultaneous access to the data by several users, etc.  The main
disadvantage is the need to learn yet another language (SQL) to do the
manipulations.  I have found Paul Dubois's book "MySQL" to be a very
good way to learn MySQL.  The first chapter alone is worth the cost of
the book.

I should warn the list that I am famous for never getting to many of
the items on the ToDo list so you should not hold your breath waiting
for the primer.

Jim Lemon <bitwrit at ozemail.com.au> writes:

> I've noticed quite a few messages relating to large data sets bedeviling
> R users, and having just had to program my way through one that actually
> caused a "Bus error" when I tried to read it in, I'd like to ask two
> questions.
> 
> 1) Are there any facilities for aggregation of data in R?
> ( I admit that this will not do much for the large data set problem
> immediately)
> 
> 2) Is there any interest out there for a C-based roll-your-own
> aggregation program?
> 
> I've had to reduce over 150,000 records to just under 3000 for a
> multi-record per case data file, and I might be able to generalize the
> code enough to make it useful for others.

-- 
Douglas Bates                            bates at stat.wisc.edu
Statistics Department                    608/262-2598
University of Wisconsin - Madison        http://www.stat.wisc.edu/~bates/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._