[R] database table merging tips with R
cminear at securecomputing.com
Thu Sep 11 21:42:47 CEST 2008
While the subquery with a temporary table is probably the better
option, you could just manually generate the subquery and pass it in
with the query. As an example, if you have user_ids 1000-1005,
instead of having "... where user_id in (select user_id from
r_user_id)", you would have "... where user_id in
(1000,1001,1002,1003,1004,1005)". With 2000 user_ids, you'd just have
R generate the string, the whole query for that matter, and pass it
in. Sure, it's a few thousand characters going into the database as a
query, but it sure beats pulling down a few million characters that
you're going to just throw away.
I have no idea how the performance of the above method would compare
to using a temporary table (on the database side, of course), but I
have a hard time believing that it wouldn't be within a factor of
two. But I'm just making the suggestion.
Avram Aelony writes:
> Dear R list,
> What is the best way to efficiently marry an R dataset with a very large (Oracle) database table?
> The goal is to only return Oracle table rows that match IDs present in the R dataset.
> I have an R data frame with 2000 user IDs analogous to: r = data.frame(userid=round(runif(2000)*100000,0))
> ...and I need to pull data from an Oracle table only for these 2000 IDs. The Oracle table is quite large. Additionally, the sql query may need to join to other tables to bring in ancillary fields.
> I currently connect to Oracle via odbc:
> connection <- odbcConnect("****", uid="****", pwd="****")
> d = sqlQuery(connection, "select userid, x, y, z from largetable where timestamp > sysdate -7")
> ...allowing me to pull data from the database table into the R object "d" and then use the R merge function. The problem however is that if "d" is too large it may fail due to memory limitations or be inefficient. I would like to push the merge portion to the database and it would be very convenient if it were possible to request that the query look to the R object for the ID's to which it should restrict the output.
> Is there a way to do this?
> Something like the following fictional code:
> d = sqlQuery(connection, "select t.userid, x, y, z from largetable t where r$userid=t.userid")
> Would sqldf (http://code.google.com/p/sqldf/) help me out here? If so, how? This would be convenient and help me avoid needing to create a temporary table to store the R data, join via sql, then return the data back to R.
> I am using R version 2.7.2 (2008-08-25) / i386-pc-mingw32 .
> Thanks for your comments, ideas, recommendations.
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help