[R-sig-DB] dbSendPreparedQuery in RSQLite
Sean Davis
@d@v|@2 @end|ng |rom m@||@n|h@gov
Wed Apr 16 21:21:43 CEST 2008
On Wed, Apr 16, 2008 at 1:35 PM, James Bullard <bullard using berkeley.edu> wrote:
> Okay, well I thought that what I was doing was not easily done with one
> query but since I am wrong more often than not I'll ask and see if anyone
> can think of a way to combine it into one query.
>
> I have a set of regions:
>
> regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320,
> 19200))
>
> In the db I have the following table:
>
> CREATE TABLE count_table (chr, location, count);
>
> chr location count
> 1 10 1
> 1 290 10
> 2 245 3
> 3 10 2
>
> The query that I am using is:
>
> apply(regions, 1, function(region) {
> paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
> location >=", region[2], "AND location <=", region[3], ";")
> })
>
> Which gives me back the following (which is exactly what I want):
> [[1]]
> chr location count__1
> 1 1 10 1
> 2 1 290 10
>
> [[2]]
> chr location count__1
> 1 2 245 3
>
> [[3]]
> NULL data frame with 0 rows
>
>
> So I don't see any obvious way to pass in multiple ranges, but maybe I
> missing something with using the IN operator, thanks in advance, sorry for
> the long email.
> counts <- data.frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2))
> dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
> dbGetQuery(con,'select * from count_table')
chr location count__1
1 1 10 1
2 1 290 10
3 2 245 3
4 3 10 2
> regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end = c(293, 320, 19200))
> dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE)
> dbGetQuery(con,'select * from regions')
chr start end__1
1 1 1 293
2 2 230 320
3 3 4950 19200
> dbGetQuery(con,'select count_table.* from count_table join regions on regions.chr=count_table.chr and count_table.location between regions.start and regions.end__1')
chr location count__1
1 1 10 1
2 1 290 10
3 2 245 3
So, the principle of a temp table works. Let's try some benchmarking.
Create a random count table of length 100k rows, write it to a
database and create some indexes to speed querying.
> counts <- data.frame(chr=sample(1:22,100000,replace=TRUE),location=sample(1:100000000,100000,replace=TRUE),count=sample(1:10,100000,replace=TRUE))
> counts[1:5,]
chr location count
1 14 91737974 5
2 5 61059218 2
3 2 23944824 4
4 3 82907389 9
5 22 94658940 1
> dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
[1] TRUE
> dbGetQuery(con,'create index ct_chr on count_table(chr)')
NULL
> dbGetQuery(con,'create index ct_location on count_table(location)')
NULL
So, to try a benchmark, I wrote a little function that creates random
regions (n of them, see function below). In the tmp table approach,
the data.frame is written to a table in the database and a join is
used. In the second, the apply() approach is used. The function
below returns the system.time for each approach. In my quick
benchmarking, the temp table approach is about 5 times faster
(including writing the temp table) with a count_table of length 100k
rows. I would expect that it might be a very much larger difference
as the size of the count_table increases, as the indexes on
count_table must be scanned for each of the apply queries and only
once for the temp table/join approach.
benchRegions <- function(n) {
# n is the number of regions.
# returns a list with 2 sets of system times, one for
# the temp table approach, and one for the apply approach
res <- list()
starts=sample(1:100000000,n,replace=TRUE)
ends=starts+sample(1:1000,n,replace=TRUE)
regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
start=starts,
end=ends)
res[['tmptab']] <-
system.time({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
dbGetQuery(con,'select count_table.* from count_table join regions on
regions.chr=count_table.chr and count_table.location between
regions.start and regions.end__1')})
res[['apply']] <- system.time(apply(regions, 1, function(region)
{dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
region[1], "AND location >=", region[2], "AND location <=",
region[3], ";"))
}))
return(res)
}
Hope that helps.
Sean
> On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
>
> > On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard using berkeley.edu>
> wrote:
> >
> > > Yes, in this example that would certainly work. However, I have a large
> > > number of queries in a tight loop and it would be nice to avoid the
> overhead
> > > of the query parsing and compiling on each call through the loop.
> > >
> >
> > A couple of typical tricks for avoidance of loops is to either:
> >
> > 1) Use an "in" query
> > 2) Use a temporary table (CREATE TEMPORARY TABLE .... or
> > dbWriteTable()) and then do a join
> >
> > If you have hundreds or thousands of queries, it is quite possible
> > that #2 could be much faster. Benchmarking would be useful here, but
> > it is definitely worth a try if you are too slow with looping.
> >
> > Sean
> >
> >
> > > thanks again, jim
> > >
> > >
> > >
> > >
> > > On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
> > >
> > >
> > > > On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard using berkeley.edu>
> > > >
> > > wrote:
> > >
> > > >
> > > >
> > > > > How do we send prepared queries in RSQLite. We are interested in
> > > > > something along the lines of
> > > > >
> > > > > ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr = ?",
> > > > > data.frame("integer"))
> > > > > for( i in 1:2)
> > > > > rs = dbGetPreparedQuery(ps, i)
> > > > >
> > > > > In C it would look something along the lines of
> > > > > http://www.mail-archive.com/sqlite-users@sqlite.org/msg31854.html
> > > > >
> > > > >
> > > >
> > > > It looks like this is not supported for select statements, though I
> > > > could easily be wrong. In this particular case, you could rewrite
> > > > your select as:
> > > >
> > > > "SELECT * from table1 where chr in (1,2)"
> > > >
> > > > Sean
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
More information about the R-sig-DB
mailing list