[R] How to Read a Large CSV into a Database with R
Lee Hachadoorian
Lee.Hachadoorian+L at gmail.com
Mon Nov 15 23:08:48 CET 2010
On Mon, 15 Nov 2010 13:28:40 -0500, Anthony Damico wrote:
> Do you have any other ideas as to how I might diagnose what's going on
> here? Or, alternatively, is there some workaround that would get this giant
> CSV into a database? If you think there's a reasonable way to use the
> IMPORT command with RSQLite, that seems like it would import the fastest,
> but I don't know that it's compatible with DBI on Windows.
>
> Thanks again!
> Anthony
If you are able to successfully read in one entire 1.2 GB file at a
time, I would skip
sqldf and do this (after setting the working directory appropriately):
connSQLite = dbConnect(dbDriver("SQLite"),
dbname = "acs")
ss09pusa = read.csv("ss09pusa.csv", header = TRUE)
dbWriteTable(connSQLite, "acs2009", ss09pusa)
rm(ss09pusa)
ss09pusb = read.csv("ss09pusb.csv", header = TRUE)
dbWriteTable(connSQLite, "acs2009", ss09pusb, append = TRUE)
rm(ss09pusb)
#Test select of specific columns
sql = "SELECT ST, PUMA, ADJINC FROM acs2009 WHERE ST = 33"
dfIncome = dbGetQuery(connSQLite, sql)
I was *not* able to load one entire table at a time, so I was able to make
it work by combining sqldf to read in chunks
at a time and dbWriteTable from RSQLite to write each chunk to the
database. The read would then look like, for example:
ss09pusa = read.csv.sql("ss09pusa.csv",
sql = paste("SELECT * FROM file WHERE ST =", i))
where i is an iteration over the state FIPS codes. (You could just use 1
to 56, even though there's a few missing numbers in there. The searches
for nonexistent records will take time but will otherwise be harmless.)
The dbWriteTable would be the same, with every write after the first one
using append = TRUE.
Also, I assume you will want to do the read from csv / write to SQLite
*once*, then maintain the SQLite database for other sessions. We also use
large census and IPUMS data sets, and we keep everything in a Postgres
backend, which we access with RPostgreSQL. Much easier to keep
everything organized in an RDBMS than to pass
around these monstrous csv files.
More information about the R-help
mailing list