[R-sig-DB] RSQLite and transparent compression

Grant Farnsworth gv|@rn@ @end|ng |rom gm@||@com
Tue Aug 6 06:35:44 CEST 2013


On Tue, Aug 6, 2013 at 12:02 AM, Kasper Daniel Hansen
<kasperdanielhansen using gmail.com> wrote:
> What do you mean by large?  You are aware you can have an in-memory version
> of a SQLite database (whether that helps depends on the size of course)?  If
> you operate on a disk based database, fast I/O helps a lot, perhaps even
> copying the database to a local drive. I don't know anything about
> compression though, but in general I have found the sqlite.org website and
> its mailing list to be super helpful.


Not outrageously large.  I'd say 10-20GB each as text delimited files.
 Still, it's too large to put in RAM and work with.  This is why I use
SQLite.  I get these files as gzipped delimited text files, then I
read them a million lines or so at a time using scan(), do some basic
clean up, and stuff them into a big SQLite database.  When I want to
use the data, I just subset the stuff I need, which fits comfortably
into RAM.  If the datasets were small enough, I'd just store them in
an R data file...then I wouldn't have to worry about type conversions
or variable name issues.

I guess it just seems wasteful to have these huge files sitting around
(or move them across networks) when the raw data was compressed and I
know the sqlite databases would compress nicely as well.  That's why
I'm specifically looking for a compression solution.  I'd be open to
other approaches, of course.  For example, I could imagine ways to
append the data into a dataframe in an .rda or .rds file and then
subset it later without ever having to load the whole thing into ram
if I used some of the big data packages, but besides the file size I'm
pretty happy with the SQLite solution---it just seemed like
transparent zipping might be available and I was surprised to find
that it wasn't.

By the way, speed isn't a critical issue.  It's not super
time-sensitive work and the network to my file server is plenty fast.
It just seems like I might have missed an obvious way to save the
space and time that lack of compression causes.




More information about the R-sig-DB mailing list