[R] How to read data sequentially into R (line by line)?

johannes rara johannesraja at gmail.com
Tue Oct 18 14:12:08 CEST 2011


I have a data set like this in one .txt file (cols separated by !):

APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!

it contains over 14 000 000 records. Now because I'm out of memory
when trying to handle this data in R, I'm trying to read it
sequentially and write it out in several .csv files (or .RData files)
and then read these into R one-by-one. One record in this data is
between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
Holtman's approach
(http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
problem is how to avoid cutting one record from the middle? I mean
that if I put nrows = 1000000, I don't know if one record (between
marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
that? My code so far:

zz <- file("myfile.txt", "r")
fileNo <- 1
repeat{

    gotError <- 1 # set to 2 if there is an error     # catch the
error if not more data
    tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="", header=FALSE),
              error=function(x) gotError <<- 2)

    if (gotError == 2) break
    # save the intermediate data
    save(input, file=sprintf("file%03d.RData", fileNo))
    fileNo <- fileNo + 1
}
close(zz)



More information about the R-help mailing list