[R] importing data to SQLite database with sqldf
Stephen Tucker
brown_emu at yahoo.com
Fri Feb 20 05:55:33 CET 2009
Hi all,
I am attempting to learn SQL through sqldf...
One task I am particularly interested in is merging separate
(presumably large) files into a single table without loading these
files into R as an intermediate step (by loading them into SQLite and
merging them there).
Taking a step back, I've considered these alternatives:
1) I know if I use straight SQLite commands I might use the 'IMPORT'
or 'INSERT INTO' command, which is not terribly flexible... (I can
read large files line-by-line in Python and use the 'INSERT INTO'
command, which is reasonably fast; I could do this in R as well but my
experience with R's input/output is that it's much slower...? and
sometimes setting up the table column definitions can be tedious if
there are many variables).
2) dbWriteTable() with append=TRUE is very convenient except that it
requires I load the data into R first...
3) sqldf's capability to put data directly into a database is
something I'd like to work out.
So in this case I have a series of tab-delimited text file with the
first line being a header.
For some reason I cannot seem to get it working. Combining examples 6
and 9 from the Google Code page (and R-help archives), I tried
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
(do I need it for SQLite?)
##
sqldf("attach 'mydb.db' as new")
f <- file("myexample.txt")
attr(f,"file.format") <- list(header=TRUE,sep="\t")
sqldf("create table myexample as select * from f",
stringsAsFactors=FALSE,
dbname="mydb.db")
## or
f <- file(fi)
sqldf("create table myexample as select * from f",
stringsAsFactors=FALSE,file.format=list(header=TRUE,sep="\t"),
dbname="mydb.db")
##
sqldf("select * from myexample",dbname="mydb.db")
gives me tables with 0 rows and 0 columns...
So in any case I have a few questions:
=== 1 ====
Would this be scalable to files with few GBs of data in them (I guess
I am uncertain of the underlying mechanism for transporting data from
the .txt file to the .db file... I see there is a call the
dbWriteTable() internally in sqldf but through the connection)? And
is there anything obviously doing wrong above?
=== 2 ===
Since I cannot 'append' rows to existing tables in SQLite (or any SQL
database), I think what I would have to do is to load each of the
files into the database and then 'rbind' them into a new table using
'union all'? The following code is what I have (the part where I read
in each file before dumping into the database file would hopefully be
replaced by the method above, if it can be made to work).
## (1) create a database file
sqldf("attach 'alltables.db' as new")
## (2) convenience function
sql <- function(...) sqldf(...,dbname="alltables.db")
## (3) load data as separate tables
for( fi in list.files(".","txt$") ) {
mydata <- read.delim(fi)
sql(sprintf("create table %s as select * from mydata",sub("\\.txt","",fi)))
}
rm(fi,mydata)
## (4) merge tables
tablenames <- as.character(sql("select * from sqlite_master")$name)
sql(paste("create table myfulltable as",
paste(sprintf("select * from %s",tablenames),
collapse=" union all ")))
## (5) delete separate tables since we have a merged one
for( nm in tablenames ) sql(sprintf("drop table %s",nm))
=== 3 ===
The following command
sqldf("attach 'mydb.db' as new")
DF <- read.delim(fi)
sqldf("create table myexample as select * from DF",dbname="mydb.db")
will usually create a .db file twice the size of the .txt file (for
now I am playing with a files ~500KB so the .db files are around
~1MB). When I create a .db file using SQLite's 'import' command,
RSQLite's dbWriteTable(), or inserting values from the same .txt file
from Python's SQLite interface, I get .db files that are approximately
the same size as the .txt file (~500KB). Is the larger file size for
sqldf's method expected?
Many thanks in advance!
Stephen Tucker
More information about the R-help
mailing list