[R] How to read data sequentially into R (line by line)?

Tue Oct 18 20:06:12 CEST 2011

Thank you Jim for your kind reply. My intention was to split one 14M
file into less than 15 text files, each of them having ~1M lines. The
idea was to make sure that one "sequence"

GG!KK!KK! --sequence start
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK! --sequence end

does not break into parts between those files so that e.g at the end
of the first file (containing ~1M lines) has
...
GG!KK!KK! --sequence start
APE!KKU!684!
APE!VAL!!
APE!UASU!!
--no sequence end here!

and the beginning of the second file

--no sequence start here!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK! --sequence end
...

-J

2011/10/18 jim holtman <jholtman at gmail.com>:
> I thought that you wanted a separate file for each of the breaks
> "GG!KK!KK!".  If you want to read in some large number of lines and
> then break them so that they have that many lines, you can do the same
> thing, except scanning from the back for a break.  So if your input
> file has 14M breaks in it, then the code I sent would create that many
> files.  If you want a minimum number of lines per file, including the
> breaks, then it can be done.  You just have to be clearer on exactly
> what the requirement are.  From your sample data, it looks like there
> were 7 text lines per record, so if your input was 14M lines, I would
> expect that you would have something in the neighborhood of 1.8M files
> with 7 lines each.  If you had 14M lines in the file and you were
> generating 14M files, then there is something wrong with your code is
> that it is not recognizing the breaks.  How many lines did each file
> have in it?
>
> On Tue, Oct 18, 2011 at 9:36 AM, johannes rara <johannesraja at gmail.com> wrote:
>> Thanks Jim for your help. I tried this code using readLines and it
>> works but not in way I wanted. It seems that this code is trying to
>> separate all records from a text file so that I'm getting over 14 000
>> 000 text files. My intention is to get only 15 text files all expect
>> one containing 1 000 000 rows so that the record which is on the
>> breakpoint (near at 1 000 000 line) does not cut from the "middle"...
>>
>> -J
>>
>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>> Use 'readLines' instead of 'read.table'.  We want to read in the text
>>> file and convert it into separate text files, each of which can then
>>> be read in using 'read.table'.  My solution assumes that you have used
>>> readLines.  Trying to do this with data frames gets messy.  Keep it
>>> simple and do it in two phases; makes it easier to debug and to see
>>> what is going on.
>>>
>>>
>>>
>>> On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>> Thanks Jim,
>>>>
>>>> I tried to convert this solution into my situation (.txt file as an input);
>>>>
>>>> zz <- file("myfile.txt", "r")
>>>>
>>>> fileNo <- 1  # used for file name
>>>> buffer <- NULL
>>>> repeat{
>>>>   input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>> row.names=NULL, na.strings="")
>>>>   if (length(input) == 0) break  # done
>>>>   buffer <- c(buffer, input)
>>>>   # find separator
>>>>   repeat{
>>>>       indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>>>       if (is.na(indx)) break  # not found yet; read more
>>>>       writeLines(buffer[1:(indx - 1L)]
>>>>           , sprintf("newFile%04d.txt", fileNo)
>>>>           )
>>>>       buffer <- buffer[-c(1:indx)]  # remove data
>>>>       fileNo <- fileNo + 1
>>>>   }
>>>> }
>>>>
>>>> but it gives me an error
>>>>
>>>> Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
>>>>  no lines available in input
>>>>>
>>>>
>>>> Do you know a reason for this?
>>>>
>>>> -J
>>>>
>>>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>>>> Let's do it in two parts: first create all the separate files (which
>>>>> if this what you are after, we can stop here).  You can change the
>>>>> value on readLines to read in as many lines as you want; I set it to 2
>>>>> just for testing.
>>>>>
>>>>> x <- textConnection("APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!
>>>>> APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!
>>>>> APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!")
>>>>>
>>>>> fileNo <- 1  # used for file name
>>>>> buffer <- NULL
>>>>> repeat{
>>>>>    input <- readLines(x, n = 100)
>>>>>    if (length(input) == 0) break  # done
>>>>>    buffer <- c(buffer, input)
>>>>>    # find separator
>>>>>    repeat{
>>>>>        indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>>>>        if (is.na(indx)) break  # not found yet; read more
>>>>>        writeLines(buffer[1:(indx - 1L)]
>>>>>            , sprintf("newFile%04d", fileNo)
>>>>>            )
>>>>>        buffer <- buffer[-c(1:indx)]  # remove data
>>>>>        fileNo <- fileNo + 1
>>>>>    }
>>>>> }
>>>>>
>>>>>
>>>>> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>>>> I have a data set like this in one .txt file (cols separated by !):
>>>>>>
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>>
>>>>>> it contains over 14 000 000 records. Now because I'm out of memory
>>>>>> when trying to handle this data in R, I'm trying to read it
>>>>>> sequentially and write it out in several .csv files (or .RData files)
>>>>>> and then read these into R one-by-one. One record in this data is
>>>>>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
>>>>>> Holtman's approach
>>>>>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
>>>>>> problem is how to avoid cutting one record from the middle? I mean
>>>>>> that if I put nrows = 1000000, I don't know if one record (between
>>>>>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
>>>>>> that? My code so far:
>>>>>>
>>>>>> zz <- file("myfile.txt", "r")
>>>>>> fileNo <- 1
>>>>>> repeat{
>>>>>>
>>>>>>    gotError <- 1 # set to 2 if there is an error     # catch the
>>>>>> error if not more data
>>>>>>    tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>>>> row.names=NULL, na.strings="", header=FALSE),
>>>>>>              error=function(x) gotError <<- 2)
>>>>>>
>>>>>>    if (gotError == 2) break
>>>>>>    # save the intermediate data
>>>>>>    save(input, file=sprintf("file%03d.RData", fileNo))
>>>>>>    fileNo <- fileNo + 1
>>>>>> }
>>>>>> close(zz)
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>