[R] Reasons to Use R [Broadcast]
Liaw, Andy
andy_liaw at merck.com
Wed Apr 11 19:00:29 CEST 2007
From: Douglas Bates
>
> On 4/10/07, Wensui Liu <liuwensui at gmail.com> wrote:
> > Greg,
> > As far as I understand, SAS is more efficient handling large data
> > probably than S+/R. Do you have any idea why?
>
> SAS originated at a time when large data sets were stored on
> magnetic tape and the only reasonable way to process them was
> sequentially.
> Thus most statistics procedures in SAS act as filters,
> processing one record at a time and accumulating summary
> information. In the past SAS performed a least squares fit
> by accumulating the crossproduct of [X:y] and then using the
> using the sweep operator to reduce that matrix. For such an
> approach the number of observations does not affect the
> amount of storage required. Adding observations just
> requires more time.
>
> This works fine (although there are numerical disadvantages
> to this approach - try mentioning the sweep operator to an
> expert in numerical linear algebra - you get a blank stare)
For those who stared blankly at the above: The sweep operator is
just a facier version of the good old Gaussian elimination...
Andy
> as long as the operations that you wish to perform fit into
> this model. Making the desired operations fit into the model
> is the primary reason for the awkwardness in many SAS analyses.
>
> The emphasis in R is on flexibility and the use of good
> numerical techniques - not on processing large data sets
> sequentially. The algorithms used in R for most least
> squares fits generate and analyze the complete model matrix
> instead of summary quantities. (The algorithms in the biglm
> package are a compromise that work on horizontal sections of
> the model matrix.)
>
> If your only criterion for comparison is the ability to work
> with very large data sets performing operations that can fit
> into the filter model used by SAS then SAS will be a better
> choice. However you do lock yourself into a certain set of
> operations and you are doing it to save memory, which is a
> commodity that decreases in price very rapidly.
>
> As mentioned in other replies, for many years the majority of
> SAS uses are for data manipulation rather than for
> statistical analysis so the filter model has been modified in
> later versions.
>
>
>
>
>
> > On 4/10/07, Greg Snow <Greg.Snow at intermountainmail.org> wrote:
> > > > -----Original Message-----
> > > > From: r-help-bounces at stat.math.ethz.ch
> > > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Bi-Info
> > > > (http://members.home.nl/bi-info)
> > > > Sent: Monday, April 09, 2007 4:23 PM
> > > > To: Gabor Grothendieck
> > > > Cc: Lorenzo Isella; r-help at stat.math.ethz.ch
> > > > Subject: Re: [R] Reasons to Use R
> > >
> > > [snip]
> > >
> > > > So what's the big deal about S using files instead of
> memory like
> > > > R. I don't get the point. Isn't there enough swap space for S?
> > > > (Who cares
> > > > anyway: it works, isn't it?) Or are there any problems
> with S and
> > > > large datasets? I don't get it. You use them, Greg. So
> you might
> > > > discuss that issue.
> > > >
> > > > Wilfred
> > > >
> > > >
> > >
> > > This is my understanding of the issue (not anything official).
> > >
> > > If you use up all the memory while in R, then the OS will start
> > > swapping memory to disk, but the OS does not know what parts of
> > > memory correspond to which objects, so it is entirely
> possible that
> > > the chunk swapped to disk contains parts of different
> data objects,
> > > so when you need one of those objects again, everything
> needs to be
> > > swapped back in. This is very inefficient.
> > >
> > > S-PLUS occasionally runs into the same problem, but since it does
> > > some of its own swapping to disk it can be more efficient by
> > > swapping single data objects (data frames, etc.). Also, since
> > > S-PLUS is already saving everything to disk, it does not actually
> > > need to do a full swap, it can just look and see that a
> particular
> > > data frame has not been used for a while, know that it is already
> > > saved on the disk, and unload it from memory without
> having to write it to disk first.
> > >
> > > The g.data package for R has some of this functionality
> of keeping
> > > data on the disk until needed.
> > >
> > > The better approach for large data sets is to only have
> some of the
> > > data in memory at a time and to automatically read just the parts
> > > that you need. So for big datasets it is recommended to have the
> > > actual data stored in a database and use one of the database
> > > connection packages to only read in the subset that you
> need. The
> > > SQLiteDF package for R is working on automating this
> process for R.
> > > There are also the bigdata module for S-PLUS and the
> biglm package
> > > for R have ways of doing some of the common analyses
> using chunks of
> > > data at a time. This idea is not new. There was a
> program in the
> > > late 1970s and 80s called Rummage by Del Scott (I guess
> technically it still exists, I have a copy on a 5.25"
> > > floppy somewhere) that used the approach of specify the model you
> > > wanted to fit first, then specify the data file. Rummage
> would then
> > > figure out which sufficient statistics were needed and
> read the data
> > > in chunks, compute the sufficient statistics on the fly, and not
> > > keep more than a couple of lines of the data in memory at once.
> > > Unfortunately it did not have much of a user interface, so when
> > > memory was cheap and datasets only medium sized it did
> not compete
> > > well, I guess it was just a bit too ahead of its time.
> > >
> > > Hope this helps,
> > >
> > >
> > >
> > > --
> > > Gregory (Greg) L. Snow Ph.D.
> > > Statistical Data Center
> > > Intermountain Healthcare
> > > greg.snow at intermountainmail.org
> > > (801) 408-8111
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >
> > --
> > WenSui Liu
> > A lousy statistician who happens to know a little programming
> > (http://spaces.msn.com/statcompute/blog)
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}
More information about the R-help
mailing list