[R] How to Read a Large CSV into a Database with R

Mon Nov 15 18:24:18 CET 2010

On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico <ajdamico at gmail.com> wrote:
> Hi Gabor,
>
> Thank you for the prompt reply.  I definitely looked over all of the
> examples on the code.google.com sqldf page before sending, which is why I
> wrote the code
>
> read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> file" , dbname="sqlite")
>
> directly pulled from their code -- read.csv.sql("~/tmp.csv", sql = "create
> table mytab as select * from file",dbname = "mydb")
>
> ..but I don't understand why this helps me around the memory problem, since
> I think it still all gets read into memory.  Is there a way to do this line
> by line?

OK. Maybe its something else.

The reading in of the file into the database should not be a resource
problem provided you have enough disk space and appropriate
permissions. sqldf / RSQLite are used to get sqlite to do it so that
the data never goes through R at that stage so R limitations can't
affect the reading in to the sqlite database.   When you read it from
the sqlite database then R limitations come into effect so you just
have to be sure not to read too much in at a time.  The use of create
table ... as select ... is to prevent sqldf from deleting the table
since sqldf is normally used in a fashion where you don't want to know
about the back end databases so it tries to create them and delete
them behind the scenes but here you want to explicitly use them so
you have to work around that.

Try this example. It should be reproducible so you just have to copy
it and paste it into your R session.  Uncomment the indicated line if
you want to be able to remove any pre-existing mydb file in the
current directory.  Try it in a fresh R session just to be sure that
nothing mucks it up.

library(sqldf)

# uncomment next line to make sure we are starting clean
# if (file.exists("mydb")) file.remove("mydb")

# create new database
sqldf("attach 'mydb' as new")

# create a new file.  BOD is built into R and has 6 rows.
write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",")

# read new file into database
read.csv.sql("tmp.csv", sql = "create table mytab as select * from file",
	dbname = "mydb")

# how many records are in table?
N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]

# read in chunks and display what we have read

k <- 4 # no of records to read at once
for(i in seq(0, N-1, k)) {
	s <- sprintf("select * from mytab limit %d, %d", i, k)
	print(sqldf(s, dbname = "mydb"))
}

On my machine I get this output:

  Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
  Time demand
1    5   15.6
2    7   19.8

showing that it read the 6 line BOD data frame in chunks of 4 as required.
-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com