[R] How to read data sequentially into R (line by line)?

johannes rara johannesraja at gmail.com
Tue Oct 18 14:57:29 CEST 2011


Thanks Jim,

I tried to convert this solution into my situation (.txt file as an input);

zz <- file("myfile.txt", "r")

fileNo <- 1  # used for file name
buffer <- NULL
repeat{
   input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
   if (length(input) == 0) break  # done
   buffer <- c(buffer, input)
   # find separator
   repeat{
       indx <- which(grepl("^GG!KK!KK!", buffer))[1]
       if (is.na(indx)) break  # not found yet; read more
       writeLines(buffer[1:(indx - 1L)]
           , sprintf("newFile%04d.txt", fileNo)
           )
       buffer <- buffer[-c(1:indx)]  # remove data
       fileNo <- fileNo + 1
   }
}

but it gives me an error

Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  no lines available in input
>

Do you know a reason for this?

-J

2011/10/18 jim holtman <jholtman at gmail.com>:
> Let's do it in two parts: first create all the separate files (which
> if this what you are after, we can stop here).  You can change the
> value on readLines to read in as many lines as you want; I set it to 2
> just for testing.
>
> x <- textConnection("APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!
> APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!
> APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!")
>
> fileNo <- 1  # used for file name
> buffer <- NULL
> repeat{
>    input <- readLines(x, n = 100)
>    if (length(input) == 0) break  # done
>    buffer <- c(buffer, input)
>    # find separator
>    repeat{
>        indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>        if (is.na(indx)) break  # not found yet; read more
>        writeLines(buffer[1:(indx - 1L)]
>            , sprintf("newFile%04d", fileNo)
>            )
>        buffer <- buffer[-c(1:indx)]  # remove data
>        fileNo <- fileNo + 1
>    }
> }
>
>
> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
>> I have a data set like this in one .txt file (cols separated by !):
>>
>> APE!KKU!684!
>> APE!VAL!!
>> APE!UASU!!
>> APE!PLA!1!
>> APE!E!10!
>> APE!TPVA!17122009!
>> APE!STAP!1!
>> GG!KK!KK!
>> APE!KKU!684!
>> APE!VAL!!
>> APE!UASU!!
>> APE!PLA!1!
>> APE!E!10!
>> APE!TPVA!17122009!
>> APE!STAP!1!
>> GG!KK!KK!
>> APE!KKU!684!
>> APE!VAL!!
>> APE!UASU!!
>> APE!PLA!1!
>> APE!E!10!
>> APE!TPVA!17122009!
>> APE!STAP!1!
>> GG!KK!KK!
>>
>> it contains over 14 000 000 records. Now because I'm out of memory
>> when trying to handle this data in R, I'm trying to read it
>> sequentially and write it out in several .csv files (or .RData files)
>> and then read these into R one-by-one. One record in this data is
>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
>> Holtman's approach
>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
>> problem is how to avoid cutting one record from the middle? I mean
>> that if I put nrows = 1000000, I don't know if one record (between
>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
>> that? My code so far:
>>
>> zz <- file("myfile.txt", "r")
>> fileNo <- 1
>> repeat{
>>
>>    gotError <- 1 # set to 2 if there is an error     # catch the
>> error if not more data
>>    tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>> row.names=NULL, na.strings="", header=FALSE),
>>              error=function(x) gotError <<- 2)
>>
>>    if (gotError == 2) break
>>    # save the intermediate data
>>    save(input, file=sprintf("file%03d.RData", fileNo))
>>    fileNo <- fileNo + 1
>> }
>> close(zz)
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>



More information about the R-help mailing list