[R] Parsing
Paolo Sonego
paolo.sonego at gmail.com
Thu Jul 10 10:24:15 CEST 2008
Thank you Martin! This code is amazing! SO fast! Exactly what i was
looking for!
Parsing ~8M lines (~ 600M file size) took about 45s on a Xeon 3,4 Ghz (8
Gb).
Thank you so much!
Sincerely,
Paolo
Martin Morgan ha scritto:
> Paolo Sonego <paolo.sonego a gmail.com> writes:
>
>
>> I apologize for giving wrong information again ... :-[
>> The number of files is not a problem (30/40). The real deal is that
>> some of my files have ~10^6 lines (file size ~ 300/400M) :'(
>> Thanks again for your help and advices!
>>
>
> If memory is not an issue, then this might be reasonably performant...
>
> process_chunk <- function(txt, rec_sep, keys)
> {
> ## filter
> keep_regex <- paste("^(",
> paste(rec_sep, keys, sep="|", collapse="|"),
> ")", sep="")
> txt <- txt[grep(keep_regex, txt)]
>
> ## construct key/value pairs
> splt <- strsplit(txt, "\\W+")
> val <- unlist(lapply(splt, "[", 2))
> names(val) <- unlist(lapply(splt, "[", 1))
>
> ## break key/value into records
> ends <- c(grep(rec_sep, txt), length(txt))
> grps <- rep(seq_along(ends), c(ends[1], diff(ends)))
> recs <- split(val, grps)
>
> ## reformat as matrix
> sapply(keys, function(key, recs) {
> res <- sapply(recs, "[", key)
> names(res) <- NULL
> res
> }, recs=recs)
> }
>
>
>> rec <- "//"
>> keys <-
>> process_chunk(readLines("/tmp/tmp.txt"), rec, keys)
>>
> x y z w id1
> [1,] "x_string" "y_string" "z_string" "w_string" "id1_string"
> [2,] "x_string1" "y_string1" "z_string1" "w_string1" NA
> [3,] "x_string2" "y_string2" "z_string2" "w_string2" "id1_string1"
> id2
> [1,] "id2_string"
> [2,] NA
> [3,] "id2_string1"
>
> This took about 130s and no more than 250Mb to process your data
> replicated to about 5M lines (~ 80M file size)
>
> I haven't really tested the following, but this might also be useful
> for processing in chunks
>
> process <- function(filename,
> rec_sep="//",
> keys=c("x", "y", "z", "w", "id1", "id2"),
> chunk_size = 10^6)
> {
> result <- NULL
> resid <- character(0)
> con <- file(filename, "r")
> while(length(txt <- readLines(con, chunk_size)) != 0) {
> recs <- grep(rec_sep, txt)
> if (length(recs) > 0) {
> maxrec <- max(recs)
> if (maxrec == length(txt)) buf <- character(0)
> else buf <- txt[(maxrec+1):length(txt)]
> txt <- c(resid, txt[-(maxrec:length(txt))])
> resid <- buf
> } else {
> txt <- c(resid, txt)
> resid <- character(0)
> }
> result <-
> rbind(result,
> process_chunk(txt, rec_sep=rec_sep, keys=keys))
>
> }
> close(con)
> if (length(resid) != 0) {
> result <-
> rbind(result,
> process_chunk(resid, rec_sep=rec_sep, keys=keys))
> }
> result
> }
>
>
>> process('/tmp/tmp.txt', chunk_size=10L) # make size much larger
>>
> x y z w id1
> [1,] "x_string" "y_string" "z_string" "w_string" "id1_string"
> [2,] "x_string1" "y_string1" "z_string1" "w_string1" NA
> [3,] "x_string2" "y_string2" "z_string2" "w_string2" "id1_string1"
> id2
> [1,] "id2_string"
> [2,] NA
> [3,] "id2_string1"
>
>
>
>
>> Regards,
>> Paolo
>>
>>
>> jim holtman ha scritto:
>>
>>> How much time is it taking on the files and how many files do you have
>>> to process? I tried it with your data duplicated so that I had 57K
>>> lines and it took 27 seconds to process. How much faster to you want?
>>>
>> ______________________________________________
>> R-help a r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
More information about the R-help
mailing list