[R] How to read data sequentially into R (line by line)?
johannes rara
johannesraja at gmail.com
Tue Oct 18 14:57:29 CEST 2011
Thanks Jim,
I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r")
fileNo <- 1 # used for file name
buffer <- NULL
repeat{
input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
if (length(input) == 0) break # done
buffer <- c(buffer, input)
# find separator
repeat{
indx <- which(grepl("^GG!KK!KK!", buffer))[1]
if (is.na(indx)) break # not found yet; read more
writeLines(buffer[1:(indx - 1L)]
, sprintf("newFile%04d.txt", fileNo)
)
buffer <- buffer[-c(1:indx)] # remove data
fileNo <- fileNo + 1
}
}
but it gives me an error
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
>
Do you know a reason for this?
-J
2011/10/18 jim holtman <jholtman at gmail.com>:
> Let's do it in two parts: first create all the separate files (which
> if this what you are after, we can stop here). You can change the
> value on readLines to read in as many lines as you want; I set it to 2
> just for testing.
>
> x <- textConnection("APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!
> APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!
> APE!KKU!684!
> APE!VAL!!
> APE!UASU!!
> APE!PLA!1!
> APE!E!10!
> APE!TPVA!17122009!
> APE!STAP!1!
> GG!KK!KK!")
>
> fileNo <- 1 # used for file name
> buffer <- NULL
> repeat{
> input <- readLines(x, n = 100)
> if (length(input) == 0) break # done
> buffer <- c(buffer, input)
> # find separator
> repeat{
> indx <- which(grepl("^GG!KK!KK!", buffer))[1]
> if (is.na(indx)) break # not found yet; read more
> writeLines(buffer[1:(indx - 1L)]
> , sprintf("newFile%04d", fileNo)
> )
> buffer <- buffer[-c(1:indx)] # remove data
> fileNo <- fileNo + 1
> }
> }
>
>
> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
>> I have a data set like this in one .txt file (cols separated by !):
>>
>> APE!KKU!684!
>> APE!VAL!!
>> APE!UASU!!
>> APE!PLA!1!
>> APE!E!10!
>> APE!TPVA!17122009!
>> APE!STAP!1!
>> GG!KK!KK!
>> APE!KKU!684!
>> APE!VAL!!
>> APE!UASU!!
>> APE!PLA!1!
>> APE!E!10!
>> APE!TPVA!17122009!
>> APE!STAP!1!
>> GG!KK!KK!
>> APE!KKU!684!
>> APE!VAL!!
>> APE!UASU!!
>> APE!PLA!1!
>> APE!E!10!
>> APE!TPVA!17122009!
>> APE!STAP!1!
>> GG!KK!KK!
>>
>> it contains over 14 000 000 records. Now because I'm out of memory
>> when trying to handle this data in R, I'm trying to read it
>> sequentially and write it out in several .csv files (or .RData files)
>> and then read these into R one-by-one. One record in this data is
>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
>> Holtman's approach
>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
>> problem is how to avoid cutting one record from the middle? I mean
>> that if I put nrows = 1000000, I don't know if one record (between
>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
>> that? My code so far:
>>
>> zz <- file("myfile.txt", "r")
>> fileNo <- 1
>> repeat{
>>
>> gotError <- 1 # set to 2 if there is an error # catch the
>> error if not more data
>> tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>> row.names=NULL, na.strings="", header=FALSE),
>> error=function(x) gotError <<- 2)
>>
>> if (gotError == 2) break
>> # save the intermediate data
>> save(input, file=sprintf("file%03d.RData", fileNo))
>> fileNo <- fileNo + 1
>> }
>> close(zz)
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>
More information about the R-help
mailing list