[R] How long does skipping in read.table take

Dimitri Liakhovitski dimitri.liakhovitski at gmail.com
Sat Oct 23 17:21:20 CEST 2010


Also am running the same code on my powerful home PC.
It's been running for 25 minutes already, and still has not printed
the first end time (does it mean it's still trying to read in DF for
the first time)?

On Sat, Oct 23, 2010 at 10:52 AM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> Just tried it on my work computer (Windows XP, I only have 2 GB RAM):
> I've run your code, just indicated the separator "|" in read.table (in
> DF line) and added the actual processing (writing out of the result
> with a file name) - see below.
> I got:
> Error in textConnection(x) : cannot allocate memory for text connection
>
> Thanks again for helping!
> Dimitri
>
> ### New code from Gabor:
> k <- 1000000 # no of rows per chunk
> first <- TRUE
> con <- file('myfile.txt', "r")
> count<-1
>
> repeat {
>
>  start<-Sys.time()
>  print(start)
>  flush.console()
>
>  # skip header
>  if (first) hdgs <- readLines(con, 1)
>  first <- FALSE
>
>  x <- readLines(con, k)
>  if (length(x) == 0) break
>  DF <- read.table(textConnection(x), header = FALSE,sep="|")
>
>  # process chunk -- we just print last row here
>  end<-Sys.time()
>  print(end-start)
>  print(names(DV))
>  print(tail(DF, 1))
>  flush.console()
>  filename<-paste("Chunk of 1 Mil number ",count,".txt",sep="")
>  write.table(DF,sep="\t",header=FALSE,file=filename)
>  count<-count+1
> }
> close(con)
>
>
> On Sat, Oct 23, 2010 at 10:19 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> On Sat, Oct 23, 2010 at 10:07 AM, Dimitri Liakhovitski
>> <dimitri.liakhovitski at gmail.com> wrote:
>>> I just tried it:
>>>
>>> for(i in 11:16){ #i<-11
>>>  start<-Sys.time()
>>>  print(start)
>>>  flush.console()
>>>  filename<-paste("skipped millions- ",i,".txt",sep="")
>>>  mydata<-read.csv.sql("myfilel.txt", sep="|", eol="\r\n", sql =
>>> "select * from file limit 1000000, (1000000*i-1)")
>>
>> The SQL statement does not know anything about R variables. You would
>> need something like this:
>>
>>> i <- 1
>>> s <- sprintf("select from file limit 10, %d", 10*1-1)
>>> s
>> [1] "select from file limit 10, 9"
>>> read.csv.sql(..., sql = s, ...)
>>
>> Also if you just want to read it in as chunks reading from a
>> connection in R would be sufficient:
>>
>> k <- 5000 # no of rows per chunk
>> first <- TRUE
>> con <- file('myfile.csv', "r")
>> repeat {
>>
>>   # skip header
>>   if (first) hdgs <- readLines(con, 1)
>>   first <- FALSE
>>
>>   x <- readLines(con, k)
>>   if (length(x) == 0) break
>>   DF <- read.csv(textConnection(x), header = FALSE)
>>
>>   # process chunk -- we just print last row here
>>   print(tail(DF, 1))
>>
>> }
>> close(con)
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>>
>
>
>
> --
> Dimitri Liakhovitski
> Ninah Consulting
> www.ninah.com
>



-- 
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com



More information about the R-help mailing list