[R-sig-DB] concurrent reading/writing in "chunks" with RSQLite (need some help troubleshooting)

Jonathan Greenberg greenberg @end|ng |rom ucd@v|@@edu
Mon Jul 5 21:36:52 CEST 2010


Keith and Seth (and r-sig-db'ers):

I somewhat understand the difference between both of your solutions --
Seth, I'm not a database person (quickly learning tho) but does your
code hold the output in memory before writing it all at the end (via
dbCommit), or is each iteration's output being stored in some "waiting
room" (on disk) where it gets written at the very end?  The key (for
my purposes) is to avoid keeping more than a chunk's worth of input
AND output in memory at any one time.

Along the lines of Keith's solution, I noticed the COUNT query can be
VERY slow -- there is no rapid way of getting the number of rows in a
database table?

Thanks again!

--j

On Tue, Jun 29, 2010 at 9:50 PM, kMan <kchamberln using gmail.com> wrote:
> Hi Seth,
>
> This particular list requires considerable stretching on my part, so I would not be surprised if I missed a goal implied by the code style used or off-list correspondence.
>
> If I understood the original code correctly, though, each iteration through the loop required (2) queries, a read and a write, where (I've assumed) the result sets are handled one at a time anyway. The code I proposed does the same thing, but overall there is (1) additional query prior to the loop to get the number of rows. The rest is just case processing b/c dbGetQuery() seemed to do funny things when LIMIT extended beyond the end of the table, and it wouldn't remember its last read location (e.g. didn't work like scan()). In addition, it works with dbWriteTable() and gets around the need for creating the table first, and filling it in later.
>
> I used the WHERE condition for lack of a better way to seek to a particular record. The fetch() & fetch_n, were those intended to get the read queries to start at the last read location (e.g. like serial reads using scan())? Eliminating need to condition match or generate a db index first would speed things up, and I'd very much like to learn that trick. I'd think it would also make for an excellent R-wiki post.
>
> Sincerely,
> KeithC.
>
> -----Original Message-----
> From: Seth Falcon [mailto:seth using userprimary.net]
> Sent: Tuesday, June 29, 2010 9:37 PM
> To: kMan
> Cc: Jonathan Greenberg; r-sig-db using stat.math.ethz.ch
> Subject: Re: R-sig-DB Digest, Vol 68, Issue 7
>
> On Tue, Jun 29, 2010 at 8:19 PM, kMan <kchamberln using gmail.com> wrote:
>> Dear Jonathan,
>>
>>>I'm trying to cycle through a database, reading a chunk of data from one
>> table in the DB, processing this data chunk, >and then writing the output to
>> another table in the database.  I'm having to set the fetch(n=) option
>> because the >database is too large to read the entire thing into memory at
>> once.  Here's my sample code:
>> [snip]
>>
>> There is no need to open separate connections for reading vs. writing.
>
> While it is true that you can read and write from a single connection,
> you can only have one result set open at a time and so if you want to
> iterate through a result set and then write, you do, I believe, need
> two connections.
>
>
> I suspect that the code you propose will be considerably slower as you
> are issuing multiple queries along with a WHERE constraint.
>
> + seth
>
>
>>
>> Your code was corrupting the connection between R/driver, I think, and I
>> don't know why. But when I close a connection in R and my OS won't let me
>> delete the file, there is a different kind of problem. The code below works
>> on my system. I wish I knew enough to explain wtf happened with your code.
>> ------------------------------------------------------------------------
>> ##
>> ## Prelim/Declarations
>> ##
>> require(RSQLite)
>> mysqldb <- "test.sqlite"
>> m <- dbDriver("SQLite")
>> tbl.names <- c("TABLEA", "TABLEB")
>> q1<-paste("SELECT COUNT(*) FROM", tbl.names[1])
>>
>> ##
>> ## Surrogate data (if not allready done)
>> ##
>> (somedata=data.frame(data1=seq(1:10),data2=seq(1:10)*3))
>> con<-dbConnect(m, dbname=mysqldb)
>> dbWriteTable(con,"TABLEA",somedata, overwrite=TRUE)
>> dbDisconnect(con)
>>
>> ##
>> ## Process TABLEA in parts, write to TABLEB
>> ##
>> con<-dbConnect(m, dbname=mysqldb)
>> n.lines<-dbGetQuery(con, q1) #NOTE class is data.frame
>> chunk.size<-3
>> start.index<-seq(from = 1, to = n.lines[1,1], by = 3)
>> limit<-c(diff(start.index), n.lines[1,1]-sum(diff(start.index)))
>> i<-1
>> for(i in 1:length(limit)){
>>  if(i<length(limit)) { q2<-paste("SELECT * FROM TABLEA WHERE row_names >=",
>>
>>      start.index[i], "LIMIT", limit[i])
>>  } else { q2<-paste("SELECT * FROM TABLEA WHERE row_names =",
>>      start.index[i], "LIMIT", limit[i])
>>  }
>>  dat<-dbGetQuery(con, q2)
>>  dbWriteTable(con, tbl.names[2], dat, row.names=FALSE, append=TRUE)
>> }
>> dbGetQuery(con, "SELECT * FROM TABLEB")
>> dbDisconnect(con)
>> ------------------------------------------------------------------------
>>
>> Sincerely,
>> KeithC.
>>
>>
>>
>
>
>
> --
> Seth Falcon | @sfalcon | http://userprimary.net/
>
>




More information about the R-sig-DB mailing list