[R] How long does skipping in read.table take

Gabor Grothendieck ggrothendieck at gmail.com
Sat Oct 23 17:22:07 CEST 2010


On Sat, Oct 23, 2010 at 10:52 AM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> Just tried it on my work computer (Windows XP, I only have 2 GB RAM):
> I've run your code, just indicated the separator "|" in read.table (in
> DF line) and added the actual processing (writing out of the result
> with a file name) - see below.
> I got:
> Error in textConnection(x) : cannot allocate memory for text connection
>
> Thanks again for helping!
> Dimitri
>
> ### New code from Gabor:
> k <- 1000000 # no of rows per chunk
> first <- TRUE
> con <- file('myfile.txt', "r")
> count<-1
>
> repeat {
>
>  start<-Sys.time()
>  print(start)
>  flush.console()
>
>  # skip header
>  if (first) hdgs <- readLines(con, 1)
>  first <- FALSE
>
>  x <- readLines(con, k)
>  if (length(x) == 0) break
>  DF <- read.table(textConnection(x), header = FALSE,sep="|")
>
>  # process chunk -- we just print last row here
>  end<-Sys.time()
>  print(end-start)
>  print(names(DV))
>  print(tail(DF, 1))
>  flush.console()
>  filename<-paste("Chunk of 1 Mil number ",count,".txt",sep="")
>  write.table(DF,sep="\t",header=FALSE,file=filename)
>  count<-count+1
> }
> close(con)
>
>

Try smaller chunks.  Presumably R cannot handle chunks that large.

Also, you could use RSQLite or sqldf to set up a database and then
read from it.  Again, don't use chunks larger than what R can handle.
Here is a self contained example that you can copy and paste into an R
session.  It works on my Windows system but you might need to change
the eol if you are working on a different platform.  Reading the file
into the database is the slowest part but once its there the rest
should be reasonably fast.  Again be sure not to read such large
chunks at a time that R cannot handle them.

library(sqldf)

## create test file
numStr <- as.character(1:25)
DF <- data.frame(a = 1:25, 101:125)
write.table(DF, file = "myfile.csv", quote = FALSE, sep = ",",
	row.names = FALSE)

## define connection with attributes
myfile <- file("myfile.csv")
attr(myfile, "file.format") <- list(header = TRUE, sep = ",", eol = "\r\n")

## create new sqlite database
sqldf("attach 'mydb' as new")

## read file into mytab table of mydb database
sqldf("create table mytab as select * from myfile", dbname = "mydb")

## check that its there
sqldf("select * from sqlite_master", dbname = "mydb")
sqldf("select count(*) from mytab", dbname = "mydb")

# Read in 5 lines after skipping 10 rows.
sqldf("select * from mytab limit 5 offset 10", dbname = "mydb")

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list