[R] How to read data sequentially into R (line by line)?
johannes rara
johannesraja at gmail.com
Tue Oct 18 20:06:12 CEST 2011
Thank you Jim for your kind reply. My intention was to split one 14M
file into less than 15 text files, each of them having ~1M lines. The
idea was to make sure that one "sequence"
GG!KK!KK! --sequence start
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK! --sequence end
does not break into parts between those files so that e.g at the end
of the first file (containing ~1M lines) has
...
GG!KK!KK! --sequence start
APE!KKU!684!
APE!VAL!!
APE!UASU!!
--no sequence end here!
and the beginning of the second file
--no sequence start here!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK! --sequence end
...
-J
2011/10/18 jim holtman <jholtman at gmail.com>:
> I thought that you wanted a separate file for each of the breaks
> "GG!KK!KK!". If you want to read in some large number of lines and
> then break them so that they have that many lines, you can do the same
> thing, except scanning from the back for a break. So if your input
> file has 14M breaks in it, then the code I sent would create that many
> files. If you want a minimum number of lines per file, including the
> breaks, then it can be done. You just have to be clearer on exactly
> what the requirement are. From your sample data, it looks like there
> were 7 text lines per record, so if your input was 14M lines, I would
> expect that you would have something in the neighborhood of 1.8M files
> with 7 lines each. If you had 14M lines in the file and you were
> generating 14M files, then there is something wrong with your code is
> that it is not recognizing the breaks. How many lines did each file
> have in it?
>
> On Tue, Oct 18, 2011 at 9:36 AM, johannes rara <johannesraja at gmail.com> wrote:
>> Thanks Jim for your help. I tried this code using readLines and it
>> works but not in way I wanted. It seems that this code is trying to
>> separate all records from a text file so that I'm getting over 14 000
>> 000 text files. My intention is to get only 15 text files all expect
>> one containing 1 000 000 rows so that the record which is on the
>> breakpoint (near at 1 000 000 line) does not cut from the "middle"...
>>
>> -J
>>
>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>> Use 'readLines' instead of 'read.table'. We want to read in the text
>>> file and convert it into separate text files, each of which can then
>>> be read in using 'read.table'. My solution assumes that you have used
>>> readLines. Trying to do this with data frames gets messy. Keep it
>>> simple and do it in two phases; makes it easier to debug and to see
>>> what is going on.
>>>
>>>
>>>
>>> On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>> Thanks Jim,
>>>>
>>>> I tried to convert this solution into my situation (.txt file as an input);
>>>>
>>>> zz <- file("myfile.txt", "r")
>>>>
>>>> fileNo <- 1 # used for file name
>>>> buffer <- NULL
>>>> repeat{
>>>> input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>> row.names=NULL, na.strings="")
>>>> if (length(input) == 0) break # done
>>>> buffer <- c(buffer, input)
>>>> # find separator
>>>> repeat{
>>>> indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>>> if (is.na(indx)) break # not found yet; read more
>>>> writeLines(buffer[1:(indx - 1L)]
>>>> , sprintf("newFile%04d.txt", fileNo)
>>>> )
>>>> buffer <- buffer[-c(1:indx)] # remove data
>>>> fileNo <- fileNo + 1
>>>> }
>>>> }
>>>>
>>>> but it gives me an error
>>>>
>>>> Error in read.table(file = file, header = header, sep = sep, quote = quote, :
>>>> no lines available in input
>>>>>
>>>>
>>>> Do you know a reason for this?
>>>>
>>>> -J
>>>>
>>>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>>>> Let's do it in two parts: first create all the separate files (which
>>>>> if this what you are after, we can stop here). You can change the
>>>>> value on readLines to read in as many lines as you want; I set it to 2
>>>>> just for testing.
>>>>>
>>>>> x <- textConnection("APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!
>>>>> APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!
>>>>> APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!")
>>>>>
>>>>> fileNo <- 1 # used for file name
>>>>> buffer <- NULL
>>>>> repeat{
>>>>> input <- readLines(x, n = 100)
>>>>> if (length(input) == 0) break # done
>>>>> buffer <- c(buffer, input)
>>>>> # find separator
>>>>> repeat{
>>>>> indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>>>> if (is.na(indx)) break # not found yet; read more
>>>>> writeLines(buffer[1:(indx - 1L)]
>>>>> , sprintf("newFile%04d", fileNo)
>>>>> )
>>>>> buffer <- buffer[-c(1:indx)] # remove data
>>>>> fileNo <- fileNo + 1
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>>>> I have a data set like this in one .txt file (cols separated by !):
>>>>>>
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>>
>>>>>> it contains over 14 000 000 records. Now because I'm out of memory
>>>>>> when trying to handle this data in R, I'm trying to read it
>>>>>> sequentially and write it out in several .csv files (or .RData files)
>>>>>> and then read these into R one-by-one. One record in this data is
>>>>>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
>>>>>> Holtman's approach
>>>>>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
>>>>>> problem is how to avoid cutting one record from the middle? I mean
>>>>>> that if I put nrows = 1000000, I don't know if one record (between
>>>>>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
>>>>>> that? My code so far:
>>>>>>
>>>>>> zz <- file("myfile.txt", "r")
>>>>>> fileNo <- 1
>>>>>> repeat{
>>>>>>
>>>>>> gotError <- 1 # set to 2 if there is an error # catch the
>>>>>> error if not more data
>>>>>> tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>>>> row.names=NULL, na.strings="", header=FALSE),
>>>>>> error=function(x) gotError <<- 2)
>>>>>>
>>>>>> if (gotError == 2) break
>>>>>> # save the intermediate data
>>>>>> save(input, file=sprintf("file%03d.RData", fileNo))
>>>>>> fileNo <- fileNo + 1
>>>>>> }
>>>>>> close(zz)
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>
More information about the R-help
mailing list