[R-sig-DB] dbSendPreparedQuery in RSQLite

Thu Apr 17 20:04:43 CEST 2008

Thank you Sean, this worked great. Now the cost of going to the  
database pales in comparison to the split that I need to do in the end  
to properly divide up the returned data.frame into a list where each  
member is a region. It is unfortunate because the data.frame returned  
to R is sorted in region order.

res <-
>> chr location count	region
>> 1       10              1		1
>> 1       290            10	1
>> 2       245            3		2
>> 3       10              2		3

So then in order to return the object in the appropriate list fashion  
I do:

split(res, res$region)

This splitting dominates the cost of the entire function.  
Unfortunately, the splitting doesn't use the fact that it is sorted  
according to the factor. This I think I will handle by going to the  
database and getting the number of rows in a region and then using  
that vector to partition up the data (unless of course anyone has any  
smarter ideas)

Thanks again, I really appreciate the detailed example it helped  
tremendously in sorting out the indexing.

jim

On Apr 16, 2008, at 12:21 PM, Sean Davis wrote:
> On Wed, Apr 16, 2008 at 1:35 PM, James Bullard  
> <bullard using berkeley.edu> wrote:
>> Okay, well I thought that what I was doing was not easily done with  
>> one
>> query but since I am wrong more often than not I'll ask and see if  
>> anyone
>> can think of a way to combine it into one query.
>>
>> I have a set of regions:
>>
>> regions <- cbind(chr = c(1,2,3), start = c(1,230,4950), end =  
>> c(293, 320,
>> 19200))
>>
>> In the db I have the following table:
>>
>> CREATE TABLE count_table (chr, location, count);
>>
>> chr location count
>> 1       10              1
>> 1       290             10
>> 2       245             3
>> 3       10              2
>>
>> The query that I am using is:
>>
>> apply(regions, 1, function(region) {
>>        paste("SELECT * FROM count_table WHERE chr =", region[1], "AND
>> location >=", region[2],  "AND location <=", region[3], ";")
>> })
>>
>> Which gives me back the following (which is exactly what I want):
>> [[1]]
>>  chr location count__1
>> 1   1       10        1
>> 2   1      290       10
>>
>> [[2]]
>>  chr location count__1
>> 1   2      245        3
>>
>> [[3]]
>> NULL data frame with 0 rows
>>
>>
>> So I don't see any obvious way to pass in multiple ranges, but  
>> maybe I
>> missing something with using the IN operator, thanks in advance,  
>> sorry for
>> the long email.
>
>> counts <-  
>> data 
>> .frame(chr=c(1,1,2,3),location=c(10,290,245,10),count=c(1,10,3,2))
>> dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
>> dbGetQuery(con,'select * from count_table')
>  chr location count__1
> 1   1       10        1
> 2   1      290       10
> 3   2      245        3
> 4   3       10        2
>
>> regions <- data.frame(chr = c(1,2,3), start = c(1,230,4950), end =  
>> c(293, 320, 19200))
>> dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE)
>> dbGetQuery(con,'select * from regions')
>  chr start end__1
> 1   1     1    293
> 2   2   230    320
> 3   3  4950  19200
>
>
>> dbGetQuery(con,'select count_table.* from count_table join regions  
>> on regions.chr=count_table.chr and count_table.location between  
>> regions.start and regions.end__1')
>  chr location count__1
> 1   1       10        1
> 2   1      290       10
> 3   2      245        3
>
> So, the principle of a temp table works.  Let's try some benchmarking.
> Create a random count table of length 100k rows, write it to a
> database and create some indexes to speed querying.
>
>
>> counts <-  
>> data 
>> .frame 
>> (chr 
>> = 
>> sample 
>> (1 
>> : 
>> 22,100000 
>> ,replace 
>> = 
>> TRUE 
>> ),location 
>> = 
>> sample 
>> (1 
>> : 
>> 100000000,100000 
>> ,replace=TRUE),count=sample(1:10,100000,replace=TRUE))
>> counts[1:5,]
>  chr location count
> 1  14 91737974     5
> 2   5 61059218     2
> 3   2 23944824     4
> 4   3 82907389     9
> 5  22 94658940     1
>> dbWriteTable(con,'count_table',counts,row.names=FALSE,overwrite=TRUE)
> [1] TRUE
>> dbGetQuery(con,'create index ct_chr on count_table(chr)')
> NULL
>> dbGetQuery(con,'create index ct_location on count_table(location)')
> NULL
>
> So, to try a benchmark, I wrote a little function that creates random
> regions (n of them, see function below).  In the tmp table approach,
> the data.frame is written to a table in the database and a join is
> used.  In the second, the apply() approach is used.  The function
> below returns the system.time for each approach.  In my quick
> benchmarking, the temp table approach is about 5 times faster
> (including writing the temp table) with a count_table of length 100k
> rows.  I would expect that it might be a very much larger difference
> as the size of the count_table increases, as the indexes on
> count_table must be scanned for each of the apply queries and only
> once for the temp table/join approach.
>
> benchRegions <- function(n) {
>   # n is the number of regions.
>   # returns a list with 2 sets of system times, one for
>   #  the temp table approach, and one for the apply approach
>   res <- list()
>   starts=sample(1:100000000,n,replace=TRUE)
>   ends=starts+sample(1:1000,n,replace=TRUE)
>   regions <- data.frame(chr=sample(1:22,n,replace=TRUE),
>                         start=starts,
>                         end=ends)
>   res[['tmptab']] <-
> system 
> .time 
> ({dbWriteTable(con,'regions',regions,row.names=FALSE,overwrite=TRUE);
> dbGetQuery(con,'select count_table.* from count_table join regions on
> regions.chr=count_table.chr and count_table.location between
> regions.start and regions.end__1')})
>   res[['apply']] <- system.time(apply(regions, 1, function(region)
> {dbGetQuery(con,paste("SELECT * FROM count_table WHERE chr =",
> region[1], "AND location >=", region[2],  "AND location <=",
> region[3], ";"))
> }))
>   return(res)
> }
>
> Hope that helps.
>
> Sean
>
>
>> On Apr 16, 2008, at 8:47 AM, Sean Davis wrote:
>>
>>> On Tue, Apr 15, 2008 at 11:45 PM, James Bullard <bullard using berkeley.edu 
>>> >
>> wrote:
>>>
>>>> Yes, in this example that would certainly work. However, I have a  
>>>> large
>>>> number of queries in a tight loop and it would be nice to avoid the
>> overhead
>>>> of the query parsing and compiling on each call through the loop.
>>>>
>>>
>>> A couple of typical tricks for avoidance of loops is to either:
>>>
>>> 1) Use an "in" query
>>> 2) Use a temporary table (CREATE TEMPORARY TABLE .... or
>>> dbWriteTable()) and then do a join
>>>
>>> If you have hundreds or thousands of queries, it is quite possible
>>> that #2 could be much faster.  Benchmarking would be useful here,  
>>> but
>>> it is definitely worth a try if you are too slow with looping.
>>>
>>> Sean
>>>
>>>
>>>> thanks again, jim
>>>>
>>>>
>>>>
>>>>
>>>> On Apr 15, 2008, at 7:21 PM, Sean Davis wrote:
>>>>
>>>>
>>>>> On Tue, Apr 15, 2008 at 8:31 PM, James Bullard <bullard using berkeley.edu 
>>>>> >
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>>> How do we send prepared queries in RSQLite. We are interested in
>>>>>> something along the lines of
>>>>>>
>>>>>> ps = dbSendPreparedQuery(conn, "SELECT * FROM table1 WHERE chr  
>>>>>> = ?",
>>>>>> data.frame("integer"))
>>>>>> for( i in 1:2)
>>>>>> rs = dbGetPreparedQuery(ps, i)
>>>>>>
>>>>>> In C it would look something along the lines of
>>>>>> http://www.mail-archive.com/sqlite-users@sqlite.org/msg31854.html
>>>>>>
>>>>>>
>>>>>
>>>>> It looks like this is not supported for select statements,  
>>>>> though I
>>>>> could easily be wrong.  In this particular case, you could rewrite
>>>>> your select as:
>>>>>
>>>>> "SELECT * from table1 where chr in (1,2)"
>>>>>
>>>>> Sean
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>