[R] Very Large Data Sets

Douglas Bates bates at stat.wisc.edu
Thu Dec 23 15:02:53 CET 1999

kmself at ix.netcom.com writes:

> When dealing with large datasets outside of SAS, my suggestion would be
> to look to tools such as Perl and MySQL to handle the procedural and
> relational processing of data, using R as an analytic tool.  Most simple
> statistics (subsetting, aggregation, drilldown) can be accommodated
> through these sorts of tools.   Think of the relationship to R as the
> division as between the DATA step and SAS/STAT or SAS/GRAPH.

> I would be interested to know of any data cube tools which are freely
> available or available as free software.

I am considering this type of approach for an application that will
involve very large data sets.  I will probably use the python
scripting language rather than perl but the general approach is as you describe.

We currently have some code in R packages to read Stata data files and
(in the "foreign" package of the src/contrib/Devel section on CRAN) to
read SAS XPORT format data libraries.  These packages can help to move
data from one format to another they don't help with dealing with
massive numbers of records in R's memory-based model.

When faced with a large data set I first want to determine the
representation of the data and some basic summaries.  After that I
might want to work with a subset of the rows and/or columns when doing
some modeling and only use the entire data set to refine or confirm
the model.

My idea is to take standard data formats for data tables (SAS XPORT
format, SPSS sav files, ...), encapsulate them as python classes, and
provide methods that would summarize the columns and perhaps emulate
Martin Maechler's excellent "str" function from R.  For example, I
would want to know if every value of a numeric variable happened to be
an integer and always in the range from 1 up to 10, or something like
that.  This would indicate to me that it was probably a coding of a
factor and not a numeric variable.  The summary methods should only
require calculations that can be done a row at a time.  Thus
calculating minima and maxima is reasonable but getting medians and
quartiles is not.  The classes would not import all the data into
python - they would simply keep around enough information to read the
data a row at a time on demand.

Each of these classes would include a method to store the data as a
table in a relational database system.  There are python packages for
most common SQL databases, including the freely available PostgreSQL
and mySQL.  The inverse transformation, SQL table to proprietary data
format, would also be provided.

To work on a subset of the data within R we could try to enhance the
functions that read data from foreign formats to allow selection of
rows or columns.  However, as you suggest, that job is probably best
handled using SQL and functions within R that extract tables or views
from the SQL database.  I would note that Timothy Keitt has just
contributed an R package to interface with PostgreSQL.

Trying to write this type of code teaches you interesting things.  As
far as I can tell, you cannot discover the number of rows in a SAS
dataset from the header information.  The number of rows is not
recorded there and, because more than one dataset can be stored in a
library file, you cannot use the length of the file and the length of
the record to calculate the number of records (rows).  If you want to
allocate storage to hold the data in R the simplest thing to do is
is to read the file once to discover the structure then read
it again to import the data.  It shows you that the idea that your
data are sitting on a reel of tape over at the "computing center" is
wired into the SAS system at a very low level.
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list