[R-sig-DB] dbSendPreparedQuery in RSQLite

Sean Davis @d@v|@2 @end|ng |rom m@||@n|h@gov
Fri Apr 18 01:19:02 CEST 2008


On Thu, Apr 17, 2008 at 2:04 PM, James Bullard <bullard using berkeley.edu> wrote:
> Thank you Sean, this worked great. Now the cost of going to the database
> pales in comparison to the split that I need to do in the end to properly
> divide up the returned data.frame into a list where each member is a region.
> It is unfortunate because the data.frame returned to R is sorted in region
> order.
>
>  res <-
>
> >
> > > chr location count      region
> > > 1       10              1               1
> > > 1       290            10       1
> > > 2       245            3                2
> > > 3       10              2               3
> > >
> >
>
>
>
>  So then in order to return the object in the appropriate list fashion I do:
>
>  split(res, res$region)
>
>  This splitting dominates the cost of the entire function. Unfortunately,
> the splitting doesn't use the fact that it is sorted according to the
> factor. This I think I will handle by going to the database and getting the
> number of rows in a region and then using that vector to partition up the
> data (unless of course anyone has any smarter ideas)

What is the final information that you want to have?  Do you actually
need the detail of the counts per region, or are you doing some
aggregate of the data in the end?  In other words, what do you do with
the data after the split?

>  Thanks again, I really appreciate the detailed example it helped
> tremendously in sorting out the indexing.

Great.  That is what email lists are all about--we all learn from them!

>
>  On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:
>
> > On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard using berkeley.edu>
> wrote:
> >
> > > Okay, well I thought that what I was doing was not easily done with one
> > > query but since I am wrong more often than not I'll ask and see if
> anyone
> > > can think of a way to combine it into one query.
> > >
> > > I have a set of regions:
> > >
> > > regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293,
> 320,
> > > 19200))
> > >
> > > In the db I have the following table:
> > >
> > > CREATE TABLE count_table (chr, location, count);
> > >
> > > chr location count
> > > 1       10              1
> > > 1       290             10
> > > 2       245             3
> > > 3       10              2
> > >
> > > The query that I am using is:
> > >
> > > apply(regions, 1, function(region) {
> > >       paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
> > > location >=", region[2],  "AND location <=", region[3], ";")
> > > })
> > >
> > > Which gives me back the following (which is exactly what I want):
> > > [[1]]
> > >  chr location count__1
> > > 1   1       10        1
> > > 2   1      290       10
> > >
> > > [[2]]
> > >  chr location count__1
> > > 1   2      245        3
> > >
> > > [[3]]
> > > NULL data frame with 0 rows
> > >
> > >
> > > So I don't see any obvious way to pass in multiple ranges, but maybe I
> > > missing something with using the IN operator, thanks in advance, sorry
> for
> > > the long email.
> > >
> >
> >
> > > counts <-
> data.frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2))
> > > dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
> > > dbGetQuery(con,'select * from count_table')
> > >
> >  chr location count__1
> > 1   1       10        1
> > 2   1      290       10
> > 3   2      245        3
> > 4   3       10        2
> >
> >
> > > regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end =
> c(293, 320, 19200))
> > > dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE)
> > > dbGetQuery(con,'select * from regions')
> > >
> >  chr start end__1
> > 1   1     1    293
> > 2   2   230    320
> > 3   3  4950  19200
> >
> >
> >
> > > dbGetQuery(con,'select count_table.* from count_table join regions on
> regions.chr=count_table.chr and count_table.location between regions.start
> and regions.end__1')
> > >
> >  chr location count__1
> > 1   1       10        1
> > 2   1      290       10
> > 3   2      245        3
> >
> > So, the principle of a temp table works.  Let's try some benchmarking.
> > Create a random count table of length 100k rows, write it to a
> > database and create some indexes to speed querying.
> >
> >
> >
> > > counts <-
> data.frame(chr=sample(1:22,100000,replace=TRUE),location=sample(1:100000000,100000,replace=TRUE),count=sample(1:10,100000,replace=TRUE))
> > > counts[1:5,]
> > >
> >  chr location count
> > 1  14 91737974     5
> > 2   5 61059218     2
> > 3   2 23944824     4
> > 4   3 82907389     9
> > 5  22 94658940     1
> >
> > > dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
> > >
> > [1] TRUE
> >
> > > dbGetQuery(con,'create index ct_chr on count_table(chr)')
> > >
> > NULL
> >
> > > dbGetQuery(con,'create index ct_location on count_table(location)')
> > >
> > NULL
> >
> > So, to try a benchmark, I wrote a little function that creates random
> > regions (n of them, see function below).  In the tmp table approach,
> > the data.frame is written to a table in the database and a join is
> > used.  In the second, the apply() approach is used.  The function
> > below returns the system.time for each approach.  In my quick
> > benchmarking, the temp table approach is about 5 times faster
> > (including writing the temp table) with a count_table of length 100k
> > rows.  I would expect that it might be a very much larger difference
> > as the size of the count_table increases, as the indexes on
> > count_table must be scanned for each of the apply queries and only
> > once for the temp table/join approach.
> >
> > benchRegions <- function(n) {
> >  # n is the number of regions.
> >  # returns a list with 2 sets of system times, one for
> >  #  the temp table approach, and one for the apply approach
> >  res <- list()
> >  starts=sample(1:100000000,n,replace=TRUE)
> >  ends=starts+sample(1:1000,n,replace=TRUE)
> >  regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
> >                        start=starts,
> >                        end=ends)
> >  res[['tmptab']] <-
> >
> system.time({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
> > dbGetQuery(con,'select count_table.* from count_table join regions on
> > regions.chr=count_table.chr and count_table.location between
> > regions.start and regions.end__1')})
> >  res[['apply']] <- system.time(apply(regions, 1, function(region)
> > {dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
> > region[1], "AND location >=", region[2],  "AND location <=",
> > region[3], ";"))
> > }))
> >  return(res)
> > }
> >
> > Hope that helps.
> >
> > Sean
> >
> >
> >
> > > On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
> > >
> > >
> > > > On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard using berkeley.edu>
> > > >
> > > wrote:
> > >
> > > >
> > > >
> > > > > Yes, in this example that would certainly work. However, I have a
> large
> > > > > number of queries in a tight loop and it would be nice to avoid the
> > > > >
> > > >
> > > overhead
> > >
> > > >
> > > > > of the query parsing and compiling on each call through the loop.
> > > > >
> > > > >
> > > >
> > > > A couple of typical tricks for avoidance of loops is to either:
> > > >
> > > > 1) Use an "in" query
> > > > 2) Use a temporary table (CREATE TEMPORARY TABLE .... or
> > > > dbWriteTable()) and then do a join
> > > >
> > > > If you have hundreds or thousands of queries, it is quite possible
> > > > that #2 could be much faster.  Benchmarking would be useful here, but
> > > > it is definitely worth a try if you are too slow with looping.
> > > >
> > > > Sean
> > > >
> > > >
> > > >
> > > > > thanks again, jim
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
> > > > >
> > > > >
> > > > >
> > > > > > On Tue, Apr 15, 2008 at 8:31 PM, James Bullard
> <bullard using berkeley.edu>
> > > > > >
> > > > > >
> > > > > wrote:
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > How do we send prepared queries in RSQLite. We are interested in
> > > > > > > something along the lines of
> > > > > > >
> > > > > > > ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr =
> ?",
> > > > > > > data.frame("integer"))
> > > > > > > for( i in 1:2)
> > > > > > > rs = dbGetPreparedQuery(ps, i)
> > > > > > >
> > > > > > > In C it would look something along the lines of
> > > > > > >
> http://www.mail-archive.com/sqlite-users@sqlite.org/msg31854.html
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > It looks like this is not supported for select statements, though
> I
> > > > > > could easily be wrong.  In this particular case, you could rewrite
> > > > > > your select as:
> > > > > >
> > > > > > "SELECT * from table1 where chr in (1,2)"
> > > > > >
> > > > > > Sean
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>
>




More information about the R-sig-DB mailing list