[R-sig-DB] concurrent reading/writing in "chunks" with RSQLite (need some help troubleshooting)

kMan kch@mber|n @end|ng |rom gm@||@com
Tue Jul 6 08:19:52 CEST 2010


Dear Jonathan,

I do not know of a quick way to query the number of records, the first time.
After that, it should be a simpler matter. I suggest looking into "CREATE
INDEX" as part of your dbGetQuery() call. With clever use of indexes you may
work around needing to create duplicate data in some cases. 

I have not actually used it for this particular case yet, so I do not have a
working example. 

Sincerely,
KeithC.


-----Original Message-----
From: jgrn307 using gmail.com [mailto:jgrn307 using gmail.com] On Behalf Of Jonathan
Greenberg
Sent: Monday, July 05, 2010 1:37 PM
To: kMan
Cc: Seth Falcon; r-sig-db using stat.math.ethz.ch
Subject: Re: concurrent reading/writing in "chunks" with RSQLite (need some
help troubleshooting)

Keith and Seth (and r-sig-db'ers):

I somewhat understand the difference between both of your solutions -- Seth,
I'm not a database person (quickly learning tho) but does your code hold the
output in memory before writing it all at the end (via dbCommit), or is each
iteration's output being stored in some "waiting room" (on disk) where it
gets written at the very end?  The key (for my purposes) is to avoid keeping
more than a chunk's worth of input AND output in memory at any one time.

Along the lines of Keith's solution, I noticed the COUNT query can be VERY
slow -- there is no rapid way of getting the number of rows in a database
table?

Thanks again!

--j

On Tue, Jun 29, 2010 at 9:50 PM, kMan <kchamberln using gmail.com> wrote:
> Hi Seth,
>
> This particular list requires considerable stretching on my part, so I
would not be surprised if I missed a goal implied by the code style used or
off-list correspondence.
>
> If I understood the original code correctly, though, each iteration
through the loop required (2) queries, a read and a write, where (I've
assumed) the result sets are handled one at a time anyway. The code I
proposed does the same thing, but overall there is (1) additional query
prior to the loop to get the number of rows. The rest is just case
processing b/c dbGetQuery() seemed to do funny things when LIMIT extended
beyond the end of the table, and it wouldn't remember its last read location
(e.g. didn't work like scan()). In addition, it works with dbWriteTable()
and gets around the need for creating the table first, and filling it in
later.
>
> I used the WHERE condition for lack of a better way to seek to a
particular record. The fetch() & fetch_n, were those intended to get the
read queries to start at the last read location (e.g. like serial reads
using scan())? Eliminating need to condition match or generate a db index
first would speed things up, and I'd very much like to learn that trick. I'd
think it would also make for an excellent R-wiki post.
>
> Sincerely,
> KeithC.
>
> -----Original Message-----
> From: Seth Falcon [mailto:seth using userprimary.net]
> Sent: Tuesday, June 29, 2010 9:37 PM
> To: kMan
> Cc: Jonathan Greenberg; r-sig-db using stat.math.ethz.ch
> Subject: Re: R-sig-DB Digest, Vol 68, Issue 7
>
> On Tue, Jun 29, 2010 at 8:19 PM, kMan <kchamberln using gmail.com> wrote:
>> Dear Jonathan,
>>
>>>I'm trying to cycle through a database, reading a chunk of data from 
>>>one
>> table in the DB, processing this data chunk, >and then writing the 
>> output to another table in the database.  I'm having to set the 
>> fetch(n=) option because the >database is too large to read the 
>> entire thing into memory at once.  Here's my sample code:
>> [snip]
>>
>> There is no need to open separate connections for reading vs. writing.
>
> While it is true that you can read and write from a single connection, 
> you can only have one result set open at a time and so if you want to 
> iterate through a result set and then write, you do, I believe, need 
> two connections.
>
>
> I suspect that the code you propose will be considerably slower as you 
> are issuing multiple queries along with a WHERE constraint.
>
> + seth
>
>
>>
>> Your code was corrupting the connection between R/driver, I think, 
>> and I don't know why. But when I close a connection in R and my OS 
>> won't let me delete the file, there is a different kind of problem. 
>> The code below works on my system. I wish I knew enough to explain wtf
happened with your code.
>> ---------------------------------------------------------------------
>> ---
>> ##
>> ## Prelim/Declarations
>> ##
>> require(RSQLite)
>> mysqldb <- "test.sqlite"
>> m <- dbDriver("SQLite")
>> tbl.names <- c("TABLEA", "TABLEB")
>> q1<-paste("SELECT COUNT(*) FROM", tbl.names[1])
>>
>> ##
>> ## Surrogate data (if not allready done) ##
>> (somedata=data.frame(data1=seq(1:10),data2=seq(1:10)*3))
>> con<-dbConnect(m, dbname=mysqldb)
>> dbWriteTable(con,"TABLEA",somedata, overwrite=TRUE)
>> dbDisconnect(con)
>>
>> ##
>> ## Process TABLEA in parts, write to TABLEB ## con<-dbConnect(m, 
>> dbname=mysqldb) n.lines<-dbGetQuery(con, q1) #NOTE class is 
>> data.frame
>> chunk.size<-3
>> start.index<-seq(from = 1, to = n.lines[1,1], by = 3) 
>> limit<-c(diff(start.index), n.lines[1,1]-sum(diff(start.index)))
>> i<-1
>> for(i in 1:length(limit)){
>>  if(i<length(limit)) { q2<-paste("SELECT * FROM TABLEA WHERE 
>> row_names >=",
>>
>>      start.index[i], "LIMIT", limit[i])
>>  } else { q2<-paste("SELECT * FROM TABLEA WHERE row_names =",
>>      start.index[i], "LIMIT", limit[i])
>>  }
>>  dat<-dbGetQuery(con, q2)
>>  dbWriteTable(con, tbl.names[2], dat, row.names=FALSE, append=TRUE) } 
>> dbGetQuery(con, "SELECT * FROM TABLEB")
>> dbDisconnect(con)
>> ---------------------------------------------------------------------
>> ---
>>
>> Sincerely,
>> KeithC.
>>
>>
>>
>
>
>
> --
> Seth Falcon | @sfalcon | http://userprimary.net/
>
>




More information about the R-sig-DB mailing list