[R] Reasons to Use R

Tue Apr 10 21:43:48 CEST 2007

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Bi-Info (http://members.home.nl/bi-info)
> Sent: Monday, April 09, 2007 4:23 PM
> To: Gabor Grothendieck
> Cc: Lorenzo Isella; r-help at stat.math.ethz.ch
> Subject: Re: [R] Reasons to Use R

[snip] 

> So what's the big deal about S using files instead of memory 
> like R. I don't get the point. Isn't there enough swap space 
> for S? (Who cares
> anyway: it works, isn't it?) Or are there any problems with S 
> and large datasets? I don't get it. You use them, Greg. So 
> you might discuss that issue.
> 
> Wilfred
> 
> 

This is my understanding of the issue (not anything official).

If you use up all the memory while in R, then the OS will start swapping
memory to disk, but the OS does not know what parts of memory correspond
to which objects, so it is entirely possible that the chunk swapped to
disk contains parts of different data objects, so when you need one of
those objects again, everything needs to be swapped back in.  This is
very inefficient.

S-PLUS occasionally runs into the same problem, but since it does some
of its own swapping to disk it can be more efficient by swapping single
data objects (data frames, etc.).  Also, since S-PLUS is already saving
everything to disk, it does not actually need to do a full swap, it can
just look and see that a particular data frame has not been used for a
while, know that it is already saved on the disk, and unload it from
memory without having to write it to disk first.

The g.data package for R has some of this functionality of keeping data
on the disk until needed.

The better approach for large data sets is to only have some of the data
in memory at a time and to automatically read just the parts that you
need.  So for big datasets it is recommended to have the actual data
stored in a database and use one of the database connection packages to
only read in the subset that you need.  The SQLiteDF package for R is
working on automating this process for R.  There are also the bigdata
module for S-PLUS and the biglm package for R have ways of doing some of
the common analyses using chunks of data at a time.  This idea is not
new.  There was a program in the late 1970s and 80s called Rummage by
Del Scott (I guess technically it still exists, I have a copy on a 5.25"
floppy somewhere) that used the approach of specify the model you wanted
to fit first, then specify the data file.  Rummage would then figure out
which sufficient statistics were needed and read the data in chunks,
compute the sufficient statistics on the fly, and not keep more than a
couple of lines of the data in memory at once.  Unfortunately it did not
have much of a user interface, so when memory was cheap and datasets
only medium sized it did not compete well, I guess it was just a bit too
ahead of its time.

Hope this helps, 

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111