[R-sig-finance] R vs. S-PLUS vs. SAS

Dirk Eddelbuettel edd at debian.org
Sat Dec 4 17:31:53 CET 2004


On Sat, Dec 04, 2004 at 07:15:40AM -0500, Andrew Piskorski wrote:
> On Fri, Dec 03, 2004 at 06:37:15PM +0000, Patrick Burns wrote:
> 
> > There may be some differences between SAS procedures, but
> > at least generally SAS does not require the whole data to be in
> > RAM.  Regression will take the data row by row and do an update
> > for the answer.
> 
> Someone might want to ask Joe Conway about his experience and thoughts
> integrating R as a procedural language inside PostgreSQL, to create
> PL/R:
> 
>   http://www.joeconway.com/plr/
>   http://gborg.postgresql.org/project/plr/projdisplay.php
> 
> (Hm, for good measure, I have Cc'd him on this email.)  Obviously, an

Very good point, but you didn't CC Joe. Done now. Hi Joe :)

> RDBMS like PostgreSQL is expert at dealing with data that doesn't fit
> into RAM.  I've no idea whether PL/R does anything special to take
> advantage of that, or how feasible it would be to do so.
> 
> Does anyone here know much about what makes R dependent on all data
> being in RAM, or of links to same?  Is it just some centralized
> low-level bits, or do broad swaths of code and algorithms all depend
> on the in-RAM assumption?

Discount my $0.02 severely enough as I don't really know what I am rambling
about, but here it goes anyway as talk is so cheap:

S implementations are from a 'workstation' design era. Data objects are in
Ram.  As Pat mentioned in this thread, they used to be way less efficient
than it is now. R made huge leaps. I haven

Our friendly listmembers from Insightful way want to complement me here with
factual data :)

> How do SAS and other such systems avoid that?  Do they do this better

SAS reflects its mainframe-age design, i.e. pass (efficiently) over huge
amounts of data that could never have been held in memory anyway.

The interactive/exploratory/graphical nature of S versus the
batch/non-interactive/non-graphical nature of SAS follows from relative
cleanly from that basic design premise.

> or much more more transparently than what an R user would do now
> manually?  Where by "manually", I mean, query some fits-in-RAM amount
> data out of an RDBMS (or other such on-disk store), analyze it, delete
> the data to free up RAM, and repeat.
>
> Could one say, tie a light-weight high-performance RDBMS library, like
> SQLite, into R, and have R use it profitably to scale nicely on data
> that does not fit in RAM?  In what way, if any, would this offer a
> substantial advantage over current manual R-plus-RDBMS practice?

Fei Chen, a doctoral student of Brian Ripley, gave a truly impressive
presentation at DSC 2003 about out-of-memory work with R. I bugged Brian
repeatedly about writeups on this, but apparently there are none. Fei now is
a professional data miner on truly gigantic data sets ...

It can be done, but it requires surgery on the engine.  For someone really
committed, it may be worth digging up Fei Chen's dissertation.  Might even
be a market niche for Insightful to explore. 

Dirk


-- 
If you don't go with R now, you will someday.
  -- David Kane on r-sig-finance, 30 Nov 2004



More information about the R-sig-finance mailing list