[R] sqldf for Very Large Tab Delimited Files

Gabor Grothendieck ggrothendieck at gmail.com
Thu Feb 2 09:11:56 CET 2012


On Wed, Feb 1, 2012 at 11:57 PM, HC <hcatbr at yahoo.co.in> wrote:
> Hi All,
>
> I have a very (very) large tab-delimited text file without headers. There
> are only 8 columns and millions of rows. I want to make numerous pieces of
> this file by sub-setting it for individual stations. Station is given as in
> the first column. I am trying to learn and use sqldf package for this but am
> stuck in a couple of places.
>
> To simulate my requirement, I have taken iris dataset as an example and have
> done the following:
> (1) create a tab-delimited file without headers.
> (2) read it using read.csv.sql command
> (3) write the result of a query, getting first 10 records
>
> Here is the reproducible code that I am trying:
> # Text data file
> write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE,
> col.names=FALSE,row.names = FALSE)
> # create an empty database (can skip this step if database already exists)
> sqldf("attach myTestdbT as new")
> f1<-file("irisNoH.txt")
> attr(f1, "file.format") <- list(header=FALSE,sep="\t")
> # read into table called irisTab in the mytestdb sqlite database
> read.csv.sql("irisNoH.txt", sql = "create table main.irisTab1 as select *
> from file", dbname = "mytestdb")
> res1<-sqldf("select * from main.irisTab1 limit 10", dbname = "mytestdb")
> write.table(res1, "iris10.txt", sep = "\t", quote = FALSE,
> col.names=FALSE,row.names = FALSE)
>
> # For querying records of a particular species - unresolved problems
> #a1<-"virginica"
> #attr(f1, "names") <- c("A1","A2","A3","A4","A5")
> #res2<-fn$sqldf("select * from main.irisTab1 where A5 = '$a1'")
>
> In the above, I am not able to:
> (1) assign the names to various columns
> (2) query for particular value of a column; in this case for particular
> species, say virginica
> (3) I guess fn$sqldf can do the job but it requires assigning column names
>
> Any help would be most appreciated.
>

Ignoring your iris file for a moment, to query the 5th column (getting
its name via sql rather than via R) we can do this:

library(sqldf)
species <- "virginica"
nms <- names(dbGetQuery(con, "select * from iris limit 0"))
fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3")

Now, sqldf is best used when you are getting the data from R but if
you want to store it in a database and just leave it there then you
might be better off using RSQLite directly like this (the eol = "\r\n"
in the dbWriteTable statement was needed on my Windows system but you
may not need that depending on your platform):


write.table(iris, "irisNoH.txt", sep = "\t", quote = FALSE, col.names
= FALSE, row.names = FALSE)

library(sqldf)
library(RSQLite)

con <- dbConnect(SQLite(), dbname = "mytestdb")

dbWriteTable(con, "iris", "irisNoH.txt", sep = "\t", eol = "\r\n")

species <- "virginica"
nms <- names(dbGetQuery(con, "select * from iris limit 0"))
fn$dbGetQuery(con, "select * from iris where `nms[5]` = '$species' limit 3")

dbDisconnect(con)


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list