[R] How to Read a Large CSV into a Database with R
Seth Falcon
seth at userprimary.net
Tue Nov 16 05:53:10 CET 2010
Hi Anthony,
On Mon, Nov 15, 2010 at 7:07 AM, Anthony Damico <ajdamico at gmail.com> wrote:
> Hi, I'm working in R 2.11.1 x64 on Windows x86_64-pc-mingw32. I'm trying to
> insert a very large CSV file into a SQLite database. I'm pretty new to
> working with databases in R, so I apologize if I'm overlooking something
> obvious here.
Working with large data and doing operations in bounded memory tends
not to be among the most obvious things to accomplish in R for new
comers.
> I need to combine both of these CSV files into one table (within a
> database), so I think that I'd need a SQL manipulation technique that reads
> everything line by line, instead of pulling it all into memory.
Yes, that sounds like the right approach. Below is some sample code
that does this and avoids reading the entire data set into memory.
The key points are to use read.table on a file connection and to read
in a batch of lines at a time. Each batch is inserted into the db
using a prepared query. The example is complicated a bit because
read.table on a file connection raises an error if no lines are
available, so I chose to use tryCatch to handle that. A cleaner
approach might be to check the number of rows read and break out of
the while loop if the count is less than the batch size.
Anyhow, see if this approach works for you.
library("RSQLite")
file_list <- c("ss09pusa.csv", "ss09pusb.csv")
input <- file(file_list[1], "r")
db <- dbConnect(SQLite(), dbname="example.sqlite")
header <- readLines(input, n = 1)
fields <- strsplit(header, ",")[[1]]
colTypes <- rep("TEXT", length(fields))
colDecl <- paste(fields, colTypes)
sql <- sprintf("CREATE TABLE ss09 (%s)", paste(colDecl, collapse = ", "))
dbGetQuery(db, sql)
colClasses <- rep("character", length(fields))
sql.in <- sprintf("INSERT INTO ss09 VALUES (%s)",
paste(rep("?", length(fields)), collapse = ","))
chunk_size <- 250000
dbBeginTransaction(db)
tryCatch({
while (TRUE) {
part <- read.table(input, nrows=chunk_size, sep=",",
colClasses = colClasses,
comment.char = "")
dbGetPreparedQuery(db, sql.in, bind.data = part)
}
}, error = function(e) {
if (grepl("no lines available", conditionMessage(e)))
TRUE
else
stop(conditionMessage(e))
})
dbCommit(db)
--
Seth Falcon | @sfalcon | http://userprimary.net/
More information about the R-help
mailing list