[R-sig-DB] dbSendPreparedQuery in RSQLite

Sean Davis @d@v|@2 @end|ng |rom m@||@n|h@gov
Wed Apr 16 21:21:43 CEST 2008


On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard using berkeley.edu> wrote:
> Okay, well I thought that what I was doing was not easily done with one
> query but since I am wrong more often than not I'll ask and see if anyone
> can think of a way to combine it into one query.
>
>  I have a set of regions:
>
>  regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320,
> 19200))
>
>  In the db I have the following table:
>
>  CREATE TABLE count_table (chr, location, count);
>
>  chr location count
>  1       10              1
>  1       290             10
>  2       245             3
>  3       10              2
>
>  The query that I am using is:
>
>  apply(regions, 1, function(region) {
>         paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
> location >=", region[2],  "AND location <=", region[3], ";")
>  })
>
>  Which gives me back the following (which is exactly what I want):
>  [[1]]
>   chr location count__1
>  1   1       10        1
>  2   1      290       10
>
>  [[2]]
>   chr location count__1
>  1   2      245        3
>
>  [[3]]
>  NULL data frame with 0 rows
>
>
>  So I don't see any obvious way to pass in multiple ranges, but maybe I
> missing something with using the IN operator, thanks in advance, sorry for
> the long email.

> counts <- data.frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2))
> dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
> dbGetQuery(con,'select * from count_table')
  chr location count__1
1   1       10        1
2   1      290       10
3   2      245        3
4   3       10        2

> regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320, 19200))
> dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE)
> dbGetQuery(con,'select * from regions')
  chr start end__1
1   1     1    293
2   2   230    320
3   3  4950  19200


> dbGetQuery(con,'select count_table.* from count_table join regions on regions.chr=count_table.chr and count_table.location between regions.start and regions.end__1')
  chr location count__1
1   1       10        1
2   1      290       10
3   2      245        3

So, the principle of a temp table works.  Let's try some benchmarking.
 Create a random count table of length 100k rows, write it to a
database and create some indexes to speed querying.


> counts <- data.frame(chr=sample(1:22,100000,replace=TRUE),location=sample(1:100000000,100000,replace=TRUE),count=sample(1:10,100000,replace=TRUE))
> counts[1:5,]
  chr location count
1  14 91737974     5
2   5 61059218     2
3   2 23944824     4
4   3 82907389     9
5  22 94658940     1
> dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
[1] TRUE
> dbGetQuery(con,'create index ct_chr on count_table(chr)')
NULL
> dbGetQuery(con,'create index ct_location on count_table(location)')
NULL

So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below).  In the tmp table approach,
the data.frame is written to a table in the database and a join is
used.  In the second, the apply() approach is used.  The function
below returns the system.time for each approach.  In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows.  I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.

benchRegions <- function(n) {
   # n is the number of regions.
   # returns a list with 2 sets of system times, one for
   #  the temp table approach, and one for the apply approach
   res <- list()
   starts=sample(1:100000000,n,replace=TRUE)
   ends=starts+sample(1:1000,n,replace=TRUE)
   regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
                         start=starts,
                         end=ends)
   res[['tmptab']] <-
system.time({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
   res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2],  "AND location <=",
region[3], ";"))
}))
   return(res)
}

Hope that helps.

Sean


>  On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
>
> > On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard using berkeley.edu>
> wrote:
> >
> > > Yes, in this example that would certainly work. However, I have a large
> > > number of queries in a tight loop and it would be nice to avoid the
> overhead
> > > of the query parsing and compiling on each call through the loop.
> > >
> >
> > A couple of typical tricks for avoidance of loops is to either:
> >
> > 1) Use an "in" query
> > 2) Use a temporary table (CREATE TEMPORARY TABLE .... or
> > dbWriteTable()) and then do a join
> >
> > If you have hundreds or thousands of queries, it is quite possible
> > that #2 could be much faster.  Benchmarking would be useful here, but
> > it is definitely worth a try if you are too slow with looping.
> >
> > Sean
> >
> >
> > > thanks again, jim
> > >
> > >
> > >
> > >
> > > On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
> > >
> > >
> > > > On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard using berkeley.edu>
> > > >
> > > wrote:
> > >
> > > >
> > > >
> > > > > How do we send prepared queries in RSQLite. We are interested in
> > > > > something along the lines of
> > > > >
> > > > > ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
> > > > > data.frame("integer"))
> > > > > for( i in 1:2)
> > > > > rs = dbGetPreparedQuery(ps, i)
> > > > >
> > > > > In C it would look something along the lines of
> > > > > http://www.mail-archive.com/sqlite-users@sqlite.org/msg31854.html
> > > > >
> > > > >
> > > >
> > > > It looks like this is not supported for select statements, though I
> > > > could easily be wrong.  In this particular case, you could rewrite
> > > > your select as:
> > > >
> > > > "SELECT * from table1 where chr in (1,2)"
> > > >
> > > > Sean
> > > >
> > > >
> > >
> > >
> > >
> >
>
>




More information about the R-sig-DB mailing list