[R] How long does skipping in read.table take

Sat Oct 23 16:19:58 CEST 2010

On Sat, Oct 23, 2010 at 10:07 AM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> I just tried it:
>
> for(i in 11:16){ #i<-11
>  start<-Sys.time()
>  print(start)
>  flush.console()
>  filename<-paste("skipped millions- ",i,".txt",sep="")
>  mydata<-read.csv.sql("myfilel.txt", sep="|", eol="\r\n", sql =
> "select * from file limit 1000000, (1000000*i-1)")

The SQL statement does not know anything about R variables. You would
need something like this:

> i <- 1
> s <- sprintf("select from file limit 10, %d", 10*1-1)
> s
[1] "select from file limit 10, 9"
> read.csv.sql(..., sql = s, ...)

Also if you just want to read it in as chunks reading from a
connection in R would be sufficient:

k <- 5000 # no of rows per chunk
first <- TRUE
con <- file('myfile.csv', "r")
repeat {

   # skip header
   if (first) hdgs <- readLines(con, 1)
   first <- FALSE

   x <- readLines(con, k)
   if (length(x) == 0) break
   DF <- read.csv(textConnection(x), header = FALSE)

   # process chunk -- we just print last row here
   print(tail(DF, 1))

}
close(con)

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com