[R] Very Large Data Sets

Thu Dec 23 11:22:11 CET 1999

There are several components to this answer.  I'm not too well versed in
R, but I've run across the capacity question before.

R has a hard limit of 2 GB total memory, as I understand, and its data
model requires holding an entire set in memory.  This is very fast until
it isn't.  This limit applies even on 64 bit systems.

SAS can "process" a practically infinite data stream, one observation at
a time (or more accurately, one read buffer at a time).  You can
approach this ideal using multiple-volume tape input on a number of OSs.
However, this ability is limited to simple and straightforward
processing -- DATA step and some very simple procedures.

Processing limits for various operations in SAS vary by OS, SAS version,
and operation.  For 32 bit OSs under releases up through 6.8 - 6.12, 2
GB RAM, 2 GB disk, and 32,767 (2^15 - 1) of many things were hard
limits.   For various reasons, the hard limits don't apply in all cases,
and workarounds were provided in several areas.

Under 64 bit OSs, these limits tend to be lifted, though occasionally 32
bit biases sneak through and bite you (there was one such bug in Proc SQL).  
Traditional limits such as the number of levels (and significant bytes
in character variables) treated by PROC FREQ have been greatly increased
in versions 7 and 8 of SAS.

Other limits are imposed more by the shear size of problems.  Many SAS
statistical procedures are based on IML and are limited by memory and
set size.  Even when large memory sets are supported, complex problems
with many levels may still exceed the capacity of any system.  Moreover,
complex statistics may make little sense on such large datasets.

When dealing with large datasets outside of SAS, my suggestion would be
to look to tools such as Perl and MySQL to handle the procedural and
relational processing of data, using R as an analytic tool.  Most simple
statistics (subsetting, aggregation, drilldown) can be accommodated
through these sorts of tools.   Think of the relationship to R as the
division as between the DATA step and SAS/STAT or SAS/GRAPH.

I would be interested to know of any data cube tools which are freely
available or available as free software.

On Wed, Dec 22, 1999 at 10:38:30PM -0700, Tony Fagan wrote:
> List,
> 
> Can R handle very large data sets (say, 100 million records) for data mining applications? My understanding is that Splus can not, but SAS can easily.
> 
> Thanks,
> Tony Fagan

-- 
Karsten M. Self (kmself at ix.netcom.com)
    What part of "Gestalt" don't you understand?

SAS for Linux: http://www.netcom.com/~kmself/SAS/SAS4Linux.html
Mailing list:  "subscribe sas-linux" to mailto:majordomo at cranfield.ac.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 290 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-help/attachments/19991223/86196eaa/attachment.bin