[R] Reasons to Use R [Broadcast]

Liaw, Andy andy_liaw at merck.com
Wed Apr 11 19:00:29 CEST 2007


From: Douglas Bates
> 
> On 4/10/07, Wensui Liu <liuwensui at gmail.com> wrote:
> > Greg,
> > As far as I understand, SAS is more efficient handling large data 
> > probably than S+/R. Do you have any idea why?
> 
> SAS originated at a time when large data sets were stored on 
> magnetic tape and the only reasonable way to process them was 
> sequentially.
> Thus most statistics procedures in SAS act as filters, 
> processing one record at a time and accumulating summary 
> information.  In the past SAS performed a least squares fit 
> by accumulating the crossproduct of [X:y] and then using the 
> using the sweep operator to reduce that matrix. For such an 
> approach the number of observations does not affect the 
> amount of storage required.  Adding observations just 
> requires more time.
> 
> This works fine (although there are numerical disadvantages 
> to this approach - try mentioning the sweep operator to an 
> expert in numerical linear algebra - you get a blank stare) 

For those who stared blankly at the above:  The sweep operator is 
just a facier version of the good old Gaussian elimination...

Andy

> as long as the operations that you wish to perform fit into 
> this model.  Making the desired operations fit into the model 
> is the primary reason for the awkwardness in many SAS analyses.
> 
> The emphasis in R is on flexibility and the use of good 
> numerical techniques - not on processing large data sets 
> sequentially.  The algorithms used in R for most least 
> squares fits generate and analyze the complete model matrix 
> instead of summary quantities.  (The algorithms in the biglm 
> package are a compromise that work on horizontal sections of 
> the model matrix.)
> 
> If your only criterion for comparison is the ability to work 
> with very large data sets performing operations that can fit 
> into the filter model used by SAS then SAS will be a better 
> choice.  However you do lock yourself into a certain set of 
> operations and you are doing it to save memory, which is a 
> commodity that decreases in price very rapidly.
> 
> As mentioned in other replies, for many years the majority of 
> SAS uses are for data manipulation rather than for 
> statistical analysis so the filter model has been modified in 
> later versions.
> 
> 
> 
> 
> 
> > On 4/10/07, Greg Snow <Greg.Snow at intermountainmail.org> wrote:
> > > > -----Original Message-----
> > > > From: r-help-bounces at stat.math.ethz.ch 
> > > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Bi-Info 
> > > > (http://members.home.nl/bi-info)
> > > > Sent: Monday, April 09, 2007 4:23 PM
> > > > To: Gabor Grothendieck
> > > > Cc: Lorenzo Isella; r-help at stat.math.ethz.ch
> > > > Subject: Re: [R] Reasons to Use R
> > >
> > > [snip]
> > >
> > > > So what's the big deal about S using files instead of 
> memory like 
> > > > R. I don't get the point. Isn't there enough swap space for S? 
> > > > (Who cares
> > > > anyway: it works, isn't it?) Or are there any problems 
> with S and 
> > > > large datasets? I don't get it. You use them, Greg. So 
> you might 
> > > > discuss that issue.
> > > >
> > > > Wilfred
> > > >
> > > >
> > >
> > > This is my understanding of the issue (not anything official).
> > >
> > > If you use up all the memory while in R, then the OS will start 
> > > swapping memory to disk, but the OS does not know what parts of 
> > > memory correspond to which objects, so it is entirely 
> possible that 
> > > the chunk swapped to disk contains parts of different 
> data objects, 
> > > so when you need one of those objects again, everything 
> needs to be 
> > > swapped back in.  This is very inefficient.
> > >
> > > S-PLUS occasionally runs into the same problem, but since it does 
> > > some of its own swapping to disk it can be more efficient by 
> > > swapping single data objects (data frames, etc.).  Also, since 
> > > S-PLUS is already saving everything to disk, it does not actually 
> > > need to do a full swap, it can just look and see that a 
> particular 
> > > data frame has not been used for a while, know that it is already 
> > > saved on the disk, and unload it from memory without 
> having to write it to disk first.
> > >
> > > The g.data package for R has some of this functionality 
> of keeping 
> > > data on the disk until needed.
> > >
> > > The better approach for large data sets is to only have 
> some of the 
> > > data in memory at a time and to automatically read just the parts 
> > > that you need.  So for big datasets it is recommended to have the 
> > > actual data stored in a database and use one of the database 
> > > connection packages to only read in the subset that you 
> need.  The 
> > > SQLiteDF package for R is working on automating this 
> process for R.  
> > > There are also the bigdata module for S-PLUS and the 
> biglm package 
> > > for R have ways of doing some of the common analyses 
> using chunks of 
> > > data at a time.  This idea is not new.  There was a 
> program in the 
> > > late 1970s and 80s called Rummage by Del Scott (I guess 
> technically it still exists, I have a copy on a 5.25"
> > > floppy somewhere) that used the approach of specify the model you 
> > > wanted to fit first, then specify the data file.  Rummage 
> would then 
> > > figure out which sufficient statistics were needed and 
> read the data 
> > > in chunks, compute the sufficient statistics on the fly, and not 
> > > keep more than a couple of lines of the data in memory at once.  
> > > Unfortunately it did not have much of a user interface, so when 
> > > memory was cheap and datasets only medium sized it did 
> not compete 
> > > well, I guess it was just a bit too ahead of its time.
> > >
> > > Hope this helps,
> > >
> > >
> > >
> > > --
> > > Gregory (Greg) L. Snow Ph.D.
> > > Statistical Data Center
> > > Intermountain Healthcare
> > > greg.snow at intermountainmail.org
> > > (801) 408-8111
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list 
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide 
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >
> > --
> > WenSui Liu
> > A lousy statistician who happens to know a little programming
> > (http://spaces.msn.com/statcompute/blog)
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}



More information about the R-help mailing list