[R] How to read data sequentially into R (line by line)?
johannes rara
johannesraja at gmail.com
Tue Oct 18 15:36:27 CEST 2011
Thanks Jim for your help. I tried this code using readLines and it
works but not in way I wanted. It seems that this code is trying to
separate all records from a text file so that I'm getting over 14 000
000 text files. My intention is to get only 15 text files all expect
one containing 1 000 000 rows so that the record which is on the
breakpoint (near at 1 000 000 line) does not cut from the "middle"...
-J
2011/10/18 jim holtman <jholtman at gmail.com>:
> Use 'readLines' instead of 'read.table'. We want to read in the text
> file and convert it into separate text files, each of which can then
> be read in using 'read.table'. My solution assumes that you have used
> readLines. Trying to do this with data frames gets messy. Keep it
> simple and do it in two phases; makes it easier to debug and to see
> what is going on.
>
>
>
> On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
>> Thanks Jim,
>>
>> I tried to convert this solution into my situation (.txt file as an input);
>>
>> zz <- file("myfile.txt", "r")
>>
>> fileNo <- 1 # used for file name
>> buffer <- NULL
>> repeat{
>> input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>> row.names=NULL, na.strings="")
>> if (length(input) == 0) break # done
>> buffer <- c(buffer, input)
>> # find separator
>> repeat{
>> indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>> if (is.na(indx)) break # not found yet; read more
>> writeLines(buffer[1:(indx - 1L)]
>> , sprintf("newFile%04d.txt", fileNo)
>> )
>> buffer <- buffer[-c(1:indx)] # remove data
>> fileNo <- fileNo + 1
>> }
>> }
>>
>> but it gives me an error
>>
>> Error in read.table(file = file, header = header, sep = sep, quote = quote, :
>> no lines available in input
>>>
>>
>> Do you know a reason for this?
>>
>> -J
>>
>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>> Let's do it in two parts: first create all the separate files (which
>>> if this what you are after, we can stop here). You can change the
>>> value on readLines to read in as many lines as you want; I set it to 2
>>> just for testing.
>>>
>>> x <- textConnection("APE!KKU!684!
>>> APE!VAL!!
>>> APE!UASU!!
>>> APE!PLA!1!
>>> APE!E!10!
>>> APE!TPVA!17122009!
>>> APE!STAP!1!
>>> GG!KK!KK!
>>> APE!KKU!684!
>>> APE!VAL!!
>>> APE!UASU!!
>>> APE!PLA!1!
>>> APE!E!10!
>>> APE!TPVA!17122009!
>>> APE!STAP!1!
>>> GG!KK!KK!
>>> APE!KKU!684!
>>> APE!VAL!!
>>> APE!UASU!!
>>> APE!PLA!1!
>>> APE!E!10!
>>> APE!TPVA!17122009!
>>> APE!STAP!1!
>>> GG!KK!KK!")
>>>
>>> fileNo <- 1 # used for file name
>>> buffer <- NULL
>>> repeat{
>>> input <- readLines(x, n = 100)
>>> if (length(input) == 0) break # done
>>> buffer <- c(buffer, input)
>>> # find separator
>>> repeat{
>>> indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>> if (is.na(indx)) break # not found yet; read more
>>> writeLines(buffer[1:(indx - 1L)]
>>> , sprintf("newFile%04d", fileNo)
>>> )
>>> buffer <- buffer[-c(1:indx)] # remove data
>>> fileNo <- fileNo + 1
>>> }
>>> }
>>>
>>>
>>> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>> I have a data set like this in one .txt file (cols separated by !):
>>>>
>>>> APE!KKU!684!
>>>> APE!VAL!!
>>>> APE!UASU!!
>>>> APE!PLA!1!
>>>> APE!E!10!
>>>> APE!TPVA!17122009!
>>>> APE!STAP!1!
>>>> GG!KK!KK!
>>>> APE!KKU!684!
>>>> APE!VAL!!
>>>> APE!UASU!!
>>>> APE!PLA!1!
>>>> APE!E!10!
>>>> APE!TPVA!17122009!
>>>> APE!STAP!1!
>>>> GG!KK!KK!
>>>> APE!KKU!684!
>>>> APE!VAL!!
>>>> APE!UASU!!
>>>> APE!PLA!1!
>>>> APE!E!10!
>>>> APE!TPVA!17122009!
>>>> APE!STAP!1!
>>>> GG!KK!KK!
>>>>
>>>> it contains over 14 000 000 records. Now because I'm out of memory
>>>> when trying to handle this data in R, I'm trying to read it
>>>> sequentially and write it out in several .csv files (or .RData files)
>>>> and then read these into R one-by-one. One record in this data is
>>>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
>>>> Holtman's approach
>>>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
>>>> problem is how to avoid cutting one record from the middle? I mean
>>>> that if I put nrows = 1000000, I don't know if one record (between
>>>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
>>>> that? My code so far:
>>>>
>>>> zz <- file("myfile.txt", "r")
>>>> fileNo <- 1
>>>> repeat{
>>>>
>>>> gotError <- 1 # set to 2 if there is an error # catch the
>>>> error if not more data
>>>> tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>> row.names=NULL, na.strings="", header=FALSE),
>>>> error=function(x) gotError <<- 2)
>>>>
>>>> if (gotError == 2) break
>>>> # save the intermediate data
>>>> save(input, file=sprintf("file%03d.RData", fileNo))
>>>> fileNo <- fileNo + 1
>>>> }
>>>> close(zz)
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>
More information about the R-help
mailing list