[R] Applying an R script to data within MySQL? How to?

Ted Byers r.ted.byers at gmail.com
Wed Oct 8 23:58:07 CEST 2008


I am trying something I haven't attempted before and the available
documentation doesn't quite answer my questions (at least in a way I can
understand).  My usual course of action would be to extract my data from my
DB, do whatever manipulation is necessary, either manually or using a C++
program, and then import the data into R.  Now I need to try to do it all
within R+RMySQL+MySQL.

I just managed to connect to MySQL and retrieve data using RMySQL as
follows:


> library(DBI)
> library(RMySQL)
> MySQL(max.con = 16, fetch.default.rec = 500, force.reload = F)
<MySQLDriver:(3800)> 
> m <- dbDriver("MySQL")
> con <- dbConnect(m, user="rejbyers", password = "jesakos",
> host="localhost", dbname = "merchants2")
> rs <- dbSendQuery(con, "select * from merchants")
> df <- fetch(rs, n = 150)
> df

And of course, that last statement is followed by the entire contents of
"merchants"

Now, I have a script like the following:
refdata18 = read.csv("K:\\MerchantData\\RiskModel\\ndays18.csv",
na.strings="")
x1 = refdata18[,1]
library(MASS)
ex1 = fitdistr(x1,"exponential")
str(ex1)


Now, the contents of ndaysXX.csv represent records where one of the date
values is in week XX of the current year. We don't yet have data spanning
multiple years, and will have to modify the SQL that gets the data
accordingly.  At present, my SQL statement groups records by WEEK of the
year, and then I manually separate weeks in a CSV file outside the DB.

Suppose I make a query like: SELECT ndays FROM xxxxxxx GROUP BY WEEK(tdate);

There is no a priori of knowing just how many weeks of data there are.

My reason for asking is I see information in the documentation about
"dbApply(RMySQL)" which says: "Applies R functions to groups of remote DBMS
rows without bringing an entire result set all at once. The result set is
expected to be sorted by the grouping field."  There is an example, but the
example doesn't make much sense (the query used, for example, does not
contain a GROUP BY clause).

I can easily set up a table that could be used to manage the output I need
(primarily the rate value estimated for each week, and the SD of the
estimate), but at present I am at a loss as to how to proceed to set this
up.  

Can some kind soul out there give me rather pedantic instructions on how to
use RMySQL to apply, in my case fitdistr, independantly to each group of
values returned by my simplistic SQL query above, and insert the rate and sd
into another table?

I know I can handle all this using a perl script to create a suite of
temporary files, and process them one by one, but I have also been advised
to try to use R instead of Perl for this kind of task.

A slightly related question is this: "Assuming I can get this all working
from within R, how would I make it a scheduled task on the one hand, or, on
the other hand, run it on demand from an event on a web page (which at
present is made using a combination of PHP, Apache's httpd server and MySQL,
if that matters)?"  Of course, if I can make such an R script (or even store
it as a function) there should be no memory from one instance to another,
because the same analysis would have to be done on different users' data.

Thanks

Ted
-- 
View this message in context: http://www.nabble.com/Applying-an-R-script-to-data-within-MySQL---How-to--tp19888407p19888407.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list