[R] How to read data sequentially into R (line by line)?

johannes rara johannesraja at gmail.com
Tue Oct 18 15:36:27 CEST 2011


Thanks Jim for your help. I tried this code using readLines and it
works but not in way I wanted. It seems that this code is trying to
separate all records from a text file so that I'm getting over 14 000
000 text files. My intention is to get only 15 text files all expect
one containing 1 000 000 rows so that the record which is on the
breakpoint (near at 1 000 000 line) does not cut from the "middle"...

-J

2011/10/18 jim holtman <jholtman at gmail.com>:
> Use 'readLines' instead of 'read.table'.  We want to read in the text
> file and convert it into separate text files, each of which can then
> be read in using 'read.table'.  My solution assumes that you have used
> readLines.  Trying to do this with data frames gets messy.  Keep it
> simple and do it in two phases; makes it easier to debug and to see
> what is going on.
>
>
>
> On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
>> Thanks Jim,
>>
>> I tried to convert this solution into my situation (.txt file as an input);
>>
>> zz <- file("myfile.txt", "r")
>>
>> fileNo <- 1  # used for file name
>> buffer <- NULL
>> repeat{
>>   input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>> row.names=NULL, na.strings="")
>>   if (length(input) == 0) break  # done
>>   buffer <- c(buffer, input)
>>   # find separator
>>   repeat{
>>       indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>       if (is.na(indx)) break  # not found yet; read more
>>       writeLines(buffer[1:(indx - 1L)]
>>           , sprintf("newFile%04d.txt", fileNo)
>>           )
>>       buffer <- buffer[-c(1:indx)]  # remove data
>>       fileNo <- fileNo + 1
>>   }
>> }
>>
>> but it gives me an error
>>
>> Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
>>  no lines available in input
>>>
>>
>> Do you know a reason for this?
>>
>> -J
>>
>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>> Let's do it in two parts: first create all the separate files (which
>>> if this what you are after, we can stop here).  You can change the
>>> value on readLines to read in as many lines as you want; I set it to 2
>>> just for testing.
>>>
>>> x <- textConnection("APE!KKU!684!
>>> APE!VAL!!
>>> APE!UASU!!
>>> APE!PLA!1!
>>> APE!E!10!
>>> APE!TPVA!17122009!
>>> APE!STAP!1!
>>> GG!KK!KK!
>>> APE!KKU!684!
>>> APE!VAL!!
>>> APE!UASU!!
>>> APE!PLA!1!
>>> APE!E!10!
>>> APE!TPVA!17122009!
>>> APE!STAP!1!
>>> GG!KK!KK!
>>> APE!KKU!684!
>>> APE!VAL!!
>>> APE!UASU!!
>>> APE!PLA!1!
>>> APE!E!10!
>>> APE!TPVA!17122009!
>>> APE!STAP!1!
>>> GG!KK!KK!")
>>>
>>> fileNo <- 1  # used for file name
>>> buffer <- NULL
>>> repeat{
>>>    input <- readLines(x, n = 100)
>>>    if (length(input) == 0) break  # done
>>>    buffer <- c(buffer, input)
>>>    # find separator
>>>    repeat{
>>>        indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>>        if (is.na(indx)) break  # not found yet; read more
>>>        writeLines(buffer[1:(indx - 1L)]
>>>            , sprintf("newFile%04d", fileNo)
>>>            )
>>>        buffer <- buffer[-c(1:indx)]  # remove data
>>>        fileNo <- fileNo + 1
>>>    }
>>> }
>>>
>>>
>>> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>> I have a data set like this in one .txt file (cols separated by !):
>>>>
>>>> APE!KKU!684!
>>>> APE!VAL!!
>>>> APE!UASU!!
>>>> APE!PLA!1!
>>>> APE!E!10!
>>>> APE!TPVA!17122009!
>>>> APE!STAP!1!
>>>> GG!KK!KK!
>>>> APE!KKU!684!
>>>> APE!VAL!!
>>>> APE!UASU!!
>>>> APE!PLA!1!
>>>> APE!E!10!
>>>> APE!TPVA!17122009!
>>>> APE!STAP!1!
>>>> GG!KK!KK!
>>>> APE!KKU!684!
>>>> APE!VAL!!
>>>> APE!UASU!!
>>>> APE!PLA!1!
>>>> APE!E!10!
>>>> APE!TPVA!17122009!
>>>> APE!STAP!1!
>>>> GG!KK!KK!
>>>>
>>>> it contains over 14 000 000 records. Now because I'm out of memory
>>>> when trying to handle this data in R, I'm trying to read it
>>>> sequentially and write it out in several .csv files (or .RData files)
>>>> and then read these into R one-by-one. One record in this data is
>>>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
>>>> Holtman's approach
>>>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
>>>> problem is how to avoid cutting one record from the middle? I mean
>>>> that if I put nrows = 1000000, I don't know if one record (between
>>>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
>>>> that? My code so far:
>>>>
>>>> zz <- file("myfile.txt", "r")
>>>> fileNo <- 1
>>>> repeat{
>>>>
>>>>    gotError <- 1 # set to 2 if there is an error     # catch the
>>>> error if not more data
>>>>    tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>> row.names=NULL, na.strings="", header=FALSE),
>>>>              error=function(x) gotError <<- 2)
>>>>
>>>>    if (gotError == 2) break
>>>>    # save the intermediate data
>>>>    save(input, file=sprintf("file%03d.RData", fileNo))
>>>>    fileNo <- fileNo + 1
>>>> }
>>>> close(zz)
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>



More information about the R-help mailing list