[R] extracting values from txt file that follow user-supplied quote
Rui Barradas
ruipbarradas at sapo.pt
Thu Jun 7 20:57:22 CEST 2012
Hello,
I've just read your follow-up question on regular expressions, and I
believe this, your original problem, can be made much faster. Just use
readLine() differently, reading large amounts of text lines at a time.
For this to work you will still need to know the total number of lines
in the file.
fun <- function(con, pattern, nlines, n=5000L){
if(is.character(con)){
con <- file(con, open="rt")
on.exit(close(con))
}
passes <- nlines %/% n
remaining <- nlines %% n
res <- NULL
for(i in seq_len(passes)){
txt <- readLines(con, n=n)
res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
}
if(remaining){
txt <- readLines(con, n=remaining)
res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
}
res
}
url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out"
pat <- "PERCENT DISCREPANCY ="
num_lines <- 14405247L
# your original
txt_con<-file(description=url,open="r")
pd <- NULL
t1 <- system.time(
for(i in 1:num_lines){
txt_line<-readLines(txt_con,n=1)
if (length(grep(pat,txt_line))) {
pd<-c(pd,as.numeric(substr(txt_line,70,78)))
}
}
)
close(txt_con)
# the function above, increased 'n'
t2 <- system.time(pd2 <- fun(url, pat, num_lines, 100000L))
all.equal(pd, pd2)
[1] TRUE
rbind(original=t1, fun=t2, ratio=t1/t2)
user.self sys.self elapsed user.child sys.child
original 780.16 196.16 981.9100 NA NA
fun 0.10 0.04 3.2000 NA NA
ratio 7801.60 4904.00 306.8469 NA NA
A factor of 300.
Hope this helps,
Rui Barradas
Em 06-06-2012 17:54, emorway escreveu:
> useRs-
>
> I'm attempting to scan a more than 1Gb text file and read and store the
> values that follow a specific key-phrase that is repeated multiple time
> throughout the file. A snippet of the text file I'm trying to read is
> attached. The text file is a dumping ground for various aspects of the
> performance of the model that generates it. Thus, the location of
> information I'm wanting to extract from the file is not in a fixed position
> (i.e. it does not always appears in a predictable location, like line 1000,
> or 2000, etc.). Rather, the desired values always follow a specific phrase:
> " PERCENT DISCREPANCY ="
>
> One approach I took was the following:
>
> library(R.utils)
>
> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
> #The above will need to be altered if one desires to test code on the
> attached txt file, which will run much quicker
> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
> num_lines
> #14405247
>
> system.time(
> for(i in 1:num_lines){
> txt_line<-readLines(txt_con,n=1)
> if (length(grep(" PERCENT DISCREPANCY =",txt_line))) {
> pd<-c(pd,as.numeric(substr(txt_line,70,78)))
> }
> }
> )
> #Time took about 5 minutes
>
> The inefficiencies in this approach arise due to reading the file twice
> (first to get num_lines, then to step through each line looking for the
> desired text).
>
> Is there a way to speed this process up through the use of a ?scan ? I
> wan't able to get anything working, but what I had in mind was scan through
> the more than 1Gb file and when the keyphrase (e.g. " PERCENT
> DISCREPANCY = ") is encountered, read and store the next 13 characters
> (which will include some white spaces) as a numeric value, then resume the
> scan until the key phrase is encountered again and repeat until the
> end-of-the-file marker is encountered. Is such an approach even possible or
> is line-by-line the best bet?
>
> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list