[R] Reasons to Use R

Wed Apr 11 16:03:41 CEST 2007

On 4/10/07, Wensui Liu <liuwensui at gmail.com> wrote:
> Greg,
> As far as I understand, SAS is more efficient handling large data
> probably than S+/R. Do you have any idea why?

SAS originated at a time when large data sets were stored on magnetic
tape and the only reasonable way to process them was sequentially.
Thus most statistics procedures in SAS act as filters, processing one
record at a time and accumulating summary information.  In the past
SAS performed a least squares fit by accumulating the crossproduct of
[X:y] and then using the using the sweep operator to reduce that
matrix. For such an approach the number of observations does not
affect the amount of storage required.  Adding observations just
requires more time.

This works fine (although there are numerical disadvantages to this
approach - try mentioning the sweep operator to an expert in numerical
linear algebra - you get a blank stare) as long as the operations that
you wish to perform fit into this model.  Making the desired
operations fit into the model is the primary reason for the
awkwardness in many SAS analyses.

The emphasis in R is on flexibility and the use of good numerical
techniques - not on processing large data sets sequentially.  The
algorithms used in R for most least squares fits generate and analyze
the complete model matrix instead of summary quantities.  (The
algorithms in the biglm package are a compromise that work on
horizontal sections of the model matrix.)

If your only criterion for comparison is the ability to work with very
large data sets performing operations that can fit into the filter
model used by SAS then SAS will be a better choice.  However you do
lock yourself into a certain set of operations and you are doing it to
save memory, which is a commodity that decreases in price very
rapidly.

As mentioned in other replies, for many years the majority of SAS uses
are for data manipulation rather than for statistical analysis so the
filter model has been modified in later versions.

> On 4/10/07, Greg Snow <Greg.Snow at intermountainmail.org> wrote:
> > > -----Original Message-----
> > > From: r-help-bounces at stat.math.ethz.ch
> > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
> > > Bi-Info (http://members.home.nl/bi-info)
> > > Sent: Monday, April 09, 2007 4:23 PM
> > > To: Gabor Grothendieck
> > > Cc: Lorenzo Isella; r-help at stat.math.ethz.ch
> > > Subject: Re: [R] Reasons to Use R
> >
> > [snip]
> >
> > > So what's the big deal about S using files instead of memory
> > > like R. I don't get the point. Isn't there enough swap space
> > > for S? (Who cares
> > > anyway: it works, isn't it?) Or are there any problems with S
> > > and large datasets? I don't get it. You use them, Greg. So
> > > you might discuss that issue.
> > >
> > > Wilfred
> > >
> > >
> >
> > This is my understanding of the issue (not anything official).
> >
> > If you use up all the memory while in R, then the OS will start swapping
> > memory to disk, but the OS does not know what parts of memory correspond
> > to which objects, so it is entirely possible that the chunk swapped to
> > disk contains parts of different data objects, so when you need one of
> > those objects again, everything needs to be swapped back in.  This is
> > very inefficient.
> >
> > S-PLUS occasionally runs into the same problem, but since it does some
> > of its own swapping to disk it can be more efficient by swapping single
> > data objects (data frames, etc.).  Also, since S-PLUS is already saving
> > everything to disk, it does not actually need to do a full swap, it can
> > just look and see that a particular data frame has not been used for a
> > while, know that it is already saved on the disk, and unload it from
> > memory without having to write it to disk first.
> >
> > The g.data package for R has some of this functionality of keeping data
> > on the disk until needed.
> >
> > The better approach for large data sets is to only have some of the data
> > in memory at a time and to automatically read just the parts that you
> > need.  So for big datasets it is recommended to have the actual data
> > stored in a database and use one of the database connection packages to
> > only read in the subset that you need.  The SQLiteDF package for R is
> > working on automating this process for R.  There are also the bigdata
> > module for S-PLUS and the biglm package for R have ways of doing some of
> > the common analyses using chunks of data at a time.  This idea is not
> > new.  There was a program in the late 1970s and 80s called Rummage by
> > Del Scott (I guess technically it still exists, I have a copy on a 5.25"
> > floppy somewhere) that used the approach of specify the model you wanted
> > to fit first, then specify the data file.  Rummage would then figure out
> > which sufficient statistics were needed and read the data in chunks,
> > compute the sufficient statistics on the fly, and not keep more than a
> > couple of lines of the data in memory at once.  Unfortunately it did not
> > have much of a user interface, so when memory was cheap and datasets
> > only medium sized it did not compete well, I guess it was just a bit too
> > ahead of its time.
> >
> > Hope this helps,
> >
> >
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at intermountainmail.org
> > (801) 408-8111
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> --
> WenSui Liu
> A lousy statistician who happens to know a little programming
> (http://spaces.msn.com/statcompute/blog)
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>