[R] How long does skipping in read.table take

Sat Oct 23 16:31:46 CEST 2010

Gabor, thanks a lot. So, I don't really need sql? That's great.
I'll try your code.
To finish with sql, I've run this:
(I wanted to skip the first 11 million rows)

 mydata<-read.csv.sql("my.file.txt", sep="|", eol="\r\n", sql =
"select * from file limit 1000000, 10999999")

After 20 min (on a 4-core 64-bit Windows 7 PC with 6 GB RAM (I assume
only 4 can be used?)) I got this error:
Error: cannot allocate vector of size 42.0 Mb

So, I guess

On Sat, Oct 23, 2010 at 10:19 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sat, Oct 23, 2010 at 10:07 AM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>> I just tried it:
>>
>> for(i in 11:16){ #i<-11
>>  start<-Sys.time()
>>  print(start)
>>  flush.console()
>>  filename<-paste("skipped millions- ",i,".txt",sep="")
>>  mydata<-read.csv.sql("myfilel.txt", sep="|", eol="\r\n", sql =
>> "select * from file limit 1000000, (1000000*i-1)")
>
> The SQL statement does not know anything about R variables. You would
> need something like this:
>
>> i <- 1
>> s <- sprintf("select from file limit 10, %d", 10*1-1)
>> s
> [1] "select from file limit 10, 9"
>> read.csv.sql(..., sql = s, ...)
>
> Also if you just want to read it in as chunks reading from a
> connection in R would be sufficient:
>
> k <- 5000 # no of rows per chunk
> first <- TRUE
> con <- file('myfile.csv', "r")
> repeat {
>
>   # skip header
>   if (first) hdgs <- readLines(con, 1)
>   first <- FALSE
>
>   x <- readLines(con, k)
>   if (length(x) == 0) break
>   DF <- read.csv(textConnection(x), header = FALSE)
>
>   # process chunk -- we just print last row here
>   print(tail(DF, 1))
>
> }
> close(con)
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

-- 
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com