[R] Reasons to Use R

Wed Apr 11 00:27:50 CEST 2007

I think SAS has the database part built into it.  I have heard 2nd hand
of new statisticians going to work for a company and asking if they have
SAS, the reply is "Yes we use SAS for our database, does it do
statistics also?"  Also I heard something about SAS is no longer
considered an acronym, they like having it be just a name and don't want
the fact that one of the S's used to stand for statistics to scare away
companies that use it as a database.

Maybe someone more up on SAS can confirm or deny this.

Also one issue to always look at is central control versus ease of
extendability.  If you have a program that is completely under your
control and does one set of things, then extending it to a new model
(big data) is fairly straight forward.  R is the opposite end of the
spectrum with many contributers and many techniques.  Extending some
basic pieces to be very efficient with big data could be done easily,
but would break many other pieces.  Getting all the different packages
to conform to a single standard in a short amount of time would be near
impossible.

With R's flexibility, there are probably some problems that can be done
quicker with a proper use of biglm than with SAS and I expect that with
some more work and maturity the SQLiteDF package may start to rival SAS
as well on certain problems.  While SAS is a useful program and great at
certain things, there are some tecniques that I would not even attempt
using SAS that are fairly straigh forward in R (I remember seeing some
SAS code to do a bootstrap that included a datastep to read in and
extract information from a SAS output file, <<SHUDDER>>  SAS/ODS has
improved this, but I would much rather bootstrap in R/S-PLUS than
anything else).

Remember, everything is better than everything else given the right
comparison.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111

> -----Original Message-----
> From: Wensui Liu [mailto:liuwensui at gmail.com] 
> Sent: Tuesday, April 10, 2007 3:26 PM
> To: Greg Snow
> Cc: Bi-Info (http://members.home.nl/bi-info); Gabor 
> Grothendieck; Lorenzo Isella; r-help at stat.math.ethz.ch
> Subject: Re: [R] Reasons to Use R
> 
> Greg,
> As far as I understand, SAS is more efficient handling large 
> data probably than S+/R. Do you have any idea why?
> 
> On 4/10/07, Greg Snow <Greg.Snow at intermountainmail.org> wrote:
> > > -----Original Message-----
> > > From: r-help-bounces at stat.math.ethz.ch 
> > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Bi-Info 
> > > (http://members.home.nl/bi-info)
> > > Sent: Monday, April 09, 2007 4:23 PM
> > > To: Gabor Grothendieck
> > > Cc: Lorenzo Isella; r-help at stat.math.ethz.ch
> > > Subject: Re: [R] Reasons to Use R
> >
> > [snip]
> >
> > > So what's the big deal about S using files instead of 
> memory like R. 
> > > I don't get the point. Isn't there enough swap space for S? (Who 
> > > cares
> > > anyway: it works, isn't it?) Or are there any problems with S and 
> > > large datasets? I don't get it. You use them, Greg. So you might 
> > > discuss that issue.
> > >
> > > Wilfred
> > >
> > >
> >
> > This is my understanding of the issue (not anything official).
> >
> > If you use up all the memory while in R, then the OS will start 
> > swapping memory to disk, but the OS does not know what 
> parts of memory 
> > correspond to which objects, so it is entirely possible 
> that the chunk 
> > swapped to disk contains parts of different data objects, 
> so when you 
> > need one of those objects again, everything needs to be 
> swapped back 
> > in.  This is very inefficient.
> >
> > S-PLUS occasionally runs into the same problem, but since 
> it does some 
> > of its own swapping to disk it can be more efficient by swapping 
> > single data objects (data frames, etc.).  Also, since S-PLUS is 
> > already saving everything to disk, it does not actually 
> need to do a 
> > full swap, it can just look and see that a particular data 
> frame has 
> > not been used for a while, know that it is already saved on 
> the disk, 
> > and unload it from memory without having to write it to disk first.
> >
> > The g.data package for R has some of this functionality of keeping 
> > data on the disk until needed.
> >
> > The better approach for large data sets is to only have some of the 
> > data in memory at a time and to automatically read just the 
> parts that 
> > you need.  So for big datasets it is recommended to have the actual 
> > data stored in a database and use one of the database connection 
> > packages to only read in the subset that you need.  The SQLiteDF 
> > package for R is working on automating this process for R.  
> There are 
> > also the bigdata module for S-PLUS and the biglm package for R have 
> > ways of doing some of the common analyses using chunks of data at a 
> > time.  This idea is not new.  There was a program in the late 1970s 
> > and 80s called Rummage by Del Scott (I guess technically it 
> still exists, I have a copy on a 5.25"
> > floppy somewhere) that used the approach of specify the model you 
> > wanted to fit first, then specify the data file.  Rummage 
> would then 
> > figure out which sufficient statistics were needed and read 
> the data 
> > in chunks, compute the sufficient statistics on the fly, 
> and not keep 
> > more than a couple of lines of the data in memory at once.  
> > Unfortunately it did not have much of a user interface, so 
> when memory 
> > was cheap and datasets only medium sized it did not compete well, I 
> > guess it was just a bit too ahead of its time.
> >
> > Hope this helps,
> >
> >
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at intermountainmail.org
> > (801) 408-8111
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> --
> WenSui Liu
> A lousy statistician who happens to know a little programming
> (http://spaces.msn.com/statcompute/blog)
>