[R] How long does skipping in read.table take
Dimitri Liakhovitski
dimitri.liakhovitski at gmail.com
Sat Oct 23 16:52:02 CEST 2010
Just tried it on my work computer (Windows XP, I only have 2 GB RAM):
I've run your code, just indicated the separator "|" in read.table (in
DF line) and added the actual processing (writing out of the result
with a file name) - see below.
I got:
Error in textConnection(x) : cannot allocate memory for text connection
Thanks again for helping!
Dimitri
### New code from Gabor:
k <- 1000000 # no of rows per chunk
first <- TRUE
con <- file('myfile.txt', "r")
count<-1
repeat {
start<-Sys.time()
print(start)
flush.console()
# skip header
if (first) hdgs <- readLines(con, 1)
first <- FALSE
x <- readLines(con, k)
if (length(x) == 0) break
DF <- read.table(textConnection(x), header = FALSE,sep="|")
# process chunk -- we just print last row here
end<-Sys.time()
print(end-start)
print(names(DV))
print(tail(DF, 1))
flush.console()
filename<-paste("Chunk of 1 Mil number ",count,".txt",sep="")
write.table(DF,sep="\t",header=FALSE,file=filename)
count<-count+1
}
close(con)
On Sat, Oct 23, 2010 at 10:19 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sat, Oct 23, 2010 at 10:07 AM, Dimitri Liakhovitski
> <dimitri.liakhovitski at gmail.com> wrote:
>> I just tried it:
>>
>> for(i in 11:16){ #i<-11
>> start<-Sys.time()
>> print(start)
>> flush.console()
>> filename<-paste("skipped millions- ",i,".txt",sep="")
>> mydata<-read.csv.sql("myfilel.txt", sep="|", eol="\r\n", sql =
>> "select * from file limit 1000000, (1000000*i-1)")
>
> The SQL statement does not know anything about R variables. You would
> need something like this:
>
>> i <- 1
>> s <- sprintf("select from file limit 10, %d", 10*1-1)
>> s
> [1] "select from file limit 10, 9"
>> read.csv.sql(..., sql = s, ...)
>
> Also if you just want to read it in as chunks reading from a
> connection in R would be sufficient:
>
> k <- 5000 # no of rows per chunk
> first <- TRUE
> con <- file('myfile.csv', "r")
> repeat {
>
> # skip header
> if (first) hdgs <- readLines(con, 1)
> first <- FALSE
>
> x <- readLines(con, k)
> if (length(x) == 0) break
> DF <- read.csv(textConnection(x), header = FALSE)
>
> # process chunk -- we just print last row here
> print(tail(DF, 1))
>
> }
> close(con)
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
--
Dimitri Liakhovitski
Ninah Consulting
www.ninah.com
More information about the R-help
mailing list