[R] MySQL + R as a Replacement for SAS Proc SQL + Various Stat Procs

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Jan 19 08:12:08 CET 2008

On Fri, 18 Jan 2008, JWilliamson at lecg.com wrote:

> For cost reasons, I'd like to replace SAS on my PC under Win XP Pro.
> Nearly all my work involves medium-size datasets (100k-10M) records which I
> cleanup, relate, fliter and get into shape for analysis using SAS/SQL
> followed by standard statistical procedures, e.g. regression using SAS proc
> reg.
> It seems to me that this type of analysis could be done in MySQL followed
> by R, but I'd like some advice about the best way to pass datasets from
> MySQL to R.  I understand there are various connectivity packages
> avaialble, RODBC and RMySQL, but I'd appreciate some advice about where to
> dig in first.
> I'd like to avoid additional syntax in my SQL code -- so it seems better
> not to coat SQL queries in R wrappers and pass them to MySQL -- probably
> better for my way of working to finish all the SQL work and pass a clean
> table ready to analyze to R.

The way these work (using RODBC is an example) is

- If necessary, send the data to MySQL via sqlSave().

- Use sqlQuery() to send SQL statements verbatim to the RDBMS (here MySQL)

- Retrieve a table via sqlFetch().

- Do the analysis on the fetched table.

If the table is very large, you can fetch in junks and use the facilities 
in the 'biglm' package to do a regression a block of data at a time.
However, I am not sure of the value of using more than 10,000 cases in a 
regression, as well before that non-sampling errors will dominate the 
error distribution: e.g. the systematic error from model misfit may be 
larger than the nominal standard errors.

I can see why experienced SAS users like to use it for data cleanup, but 
it seem generally true that the user is a more important variable than the 
tool: people work best with the tools they understand best (and personal 
preference comes into it).

> One of the great advantages for me using SAS is that I can beat the data
> into shape using proc SQL and then call proc STAT_OF_THE_DAY all in the
> same batch file with no plumbing, data conversion or additional
> machination.

The way I am sketching above is using R as the scripting language. It's a 
pretty powerful one, certainly powerful enuough to do the text processing 
needed to prepare SQL queries.

> But if I could do this in MySQL + R, well in a few years I'd have the down
> payment for an Audi TT instead of having given it to SAS Institute.
> If someone found a practical way to make this work -- pls let me know.  And
> thanks in advance.
> Jack Williamson
> 2049 Century Park East, Suite 2300
> Los Angeles CA 90067
> 323-683-5004
> jwilliamson at lecg.com
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

More information about the R-help mailing list