[R] iterators : checkFunc with ireadLines

Wed Jun 3 17:45:36 CEST 2020

Laurent... Bill is suggesting building your own indexed database... but this has been done before, so re-inventing the wheel seems inefficient and risky. It is actually impossible to create such a beast without reading the entire file into memory at least temporarily anyway, so you are better off looking at ways to process the entire file efficiently.

For example, you could load the data into a sqlite database in a couple of lines of code and use SQL directly or use the sqldf data frame interface, or use dplyr to query the database.

Or you could look at read_csv_chunked from readr package.

On May 18, 2020 11:37:46 AM PDT, William Michels via R-help <r-help using r-project.org> wrote:
>
>Hi Laurent,
>
>Thank you for explaining your size limitations. Below is an example
>using the read.fwf() function to grab the first column of your input
>file (in 2000 row chunks). This column is converted to an index, and
>the index is used to create an iterator useful for skipping lines when
>reading input with scan(). (You could try processing your large file
>in successive 2000 line chunks, or whatever number of lines fits into
>memory). Maybe not as elegant as the approach you were going for, but
>read.fwf() should be pretty efficient:
>
>> sensors <-  c("N053", "N163")
>> read.fwf("test2.txt", widths=c(4), as.is=TRUE, flush=TRUE, n=2000,
>skip=0)
>    V1
>1 Time
>2 N023
>3 N053
>4 N123
>5 N163
>6 N193
>> first_col <- read.fwf("test2.txt", widths=c(4), as.is=TRUE,
>flush=TRUE, n=2000, skip=0)
>> which(first_col$V1 %in% sensors)
>[1] 3 5
>> index1 <- which(first_col$V1 %in% sensors)
>> iter_index1 <- iter(1:2000, checkFunc= function(n) {n %in% index1})
>> unlist(scan(file="test2.txt",
>what=list("","","","","","","","","",""), flush=TRUE, multi.line=FALSE,
>skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
> [1] "N053"      "-0.014083" "-0.004741" "0.001443"  "-0.010152"
>"-0.012996" "-0.005337" "-0.008738" "-0.015094" "-0.012104"
>> unlist(scan(file="test2.txt",
>what=list("","","","","","","","","",""), flush=TRUE, multi.line=FALSE,
>skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
> [1] "N163"      "-0.054023" "-0.049345" "-0.037158" "-0.04112"
>"-0.044612" "-0.036953" "-0.036061" "-0.044516" "-0.046436"
>>
>
>(Note for this email and the previous one, I've deleted the first
>"hash" character from each line of your test file for clarity).
>
>HTH, Bill.
>
>W. Michels, Ph.D.
>
>
>
>
>
>On Mon, May 18, 2020 at 3:35 AM Laurent Rhelp <LaurentRHelp using free.fr>
>wrote:
>>
>> Dear William,
>>   Thank you for your answer
>> My file is very large so I cannot read it in my memory (I cannot use
>> read.table). So I want to put in memory only the line I need to
>process.
>> With readLines, as I did, it works but I would like to use an
>iterator
>> and a foreach loop to understand this way to do because I thought
>that
>> it was a better solution to write a nice code.
>>
>>
>> Le 18/05/2020 à 04:54, William Michels a écrit :
>> > Apologies, Laurent, for this two-part answer. I misunderstood your
>> > post where you stated you wanted to "filter(ing) some
>> > selected lines according to the line name... ." I thought that
>meant
>> > you had a separate index (like a series of primes) that you wanted
>to
>> > use to only read-in selected line numbers from a file (test file
>below
>> > with numbers 1:1000 each on a separate line):
>> >
>> >> library(gmp)
>> >> library(iterators)
>> >> iprime <- iter(1:100, checkFunc = function(n) isprime(n))
>> >> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>> > Read 1 item
>> > [1] 2
>> >> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>> > Read 1 item
>> > [1] 3
>> >> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>> > Read 1 item
>> > [1] 5
>> >> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>> > Read 1 item
>> > [1] 7
>> > However, what it really seems that you want to do is read each line
>of
>> > a (possibly enormous) file, test each line "string-wise" to keep or
>> > discard, and if you're keeping it, append the line to a list. I can
>> > certainly see the advantage of this strategy for reading in very,
>very
>> > large files, but it's not clear to me how the "ireadLines" function
>(
>> > in the "iterators" package) will help you, since it doesn't seem to
>> > generate anything but a sequential index.
>> >
>> > Anyway, below is an absolutely standard read-in of your data using
>> > read.table(). Hopefully some of the code I've posted has been
>useful
>> > to you.
>> >
>> >> sensors <-  c("N053", "N163")
>> >> read.table("test2.txt")
>> >      V1        V2        V3        V4        V5        V6        V7
>> >     V8        V9       V10
>> > 1 Time  0.000000  0.000999  0.001999  0.002998  0.003998  0.004997
>> > 0.005997  0.006996  0.007996
>> > 2 N023 -0.031323 -0.035026 -0.029759 -0.024886 -0.024464 -0.026816
>> > -0.033690 -0.041067 -0.038747
>> > 3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
>> > -0.008738 -0.015094 -0.012104
>> > 4 N123 -0.019008 -0.013494 -0.013180 -0.029208 -0.032748 -0.020243
>> > -0.015089 -0.014439 -0.011681
>> > 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953
>> > -0.036061 -0.044516 -0.046436
>> > 6 N193 -0.022171 -0.022384 -0.022338 -0.023304 -0.022569 -0.021827
>> > -0.021996 -0.021755 -0.021846
>> >> Laurent_data <- read.table("test2.txt")
>> >> Laurent_data[Laurent_data$V1 %in% sensors, ]
>> >      V1        V2        V3        V4        V5        V6        V7
>> >     V8        V9       V10
>> > 3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
>> > -0.008738 -0.015094 -0.012104
>> > 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953
>> > -0.036061 -0.044516 -0.046436
>> >
>> > Best, Bill.
>> >
>> > W. Michels, Ph.D.
>> >
>> >
>> > On Sun, May 17, 2020 at 5:43 PM Laurent Rhelp
><LaurentRHelp using free.fr> wrote:
>> >> Dear R-Help List,
>> >>
>> >>      I would like to use an iterator to read a file filtering some
>> >> selected lines according to the line name in order to use after a
>> >> foreach loop. I wanted to use the checkFunc argument as the
>following
>> >> example found on internet to select only prime numbers :
>> >>
>> >> |                                iprime <- ||iter||(1:100,
>checkFunc =
>> >> ||function||(n) ||isprime||(n))|
>> >>
>> >> |(https://datawookie.netlify.app/blog/2013/11/iterators-in-r/)
>> >> <https://datawookie.netlify.app/blog/2013/11/iterators-in-r/>|
>> >>
>> >> but the checkFunc argument seems not to be available with the
>function
>> >> ireadLines (package iterators). So, I did the code below to solve
>my
>> >> problem but I am sure that I miss something to use iterators with
>files.
>> >> Since I found nothing on the web about ireadLines and the
>checkFunc
>> >> argument, could somebody help me to understand how we have to use
>> >> iterator (and foreach loop) on files keeping only selected lines ?
>> >>
>> >> Thank you very much
>> >> Laurent
>> >>
>> >> Presently here is my code:
>> >>
>> >> ##        mock file to read: test.txt
>> >> ##
>> >> # Time    0    0.000999    0.001999    0.002998    0.003998
>0.004997
>> >> 0.005997    0.006996    0.007996
>> >> # N023    -0.031323    -0.035026    -0.029759    -0.024886
>-0.024464
>> >> -0.026816    -0.03369    -0.041067    -0.038747
>> >> # N053    -0.014083    -0.004741    0.001443    -0.010152
>-0.012996
>> >> -0.005337    -0.008738    -0.015094    -0.012104
>> >> # N123    -0.019008    -0.013494    -0.01318    -0.029208
>-0.032748
>> >> -0.020243    -0.015089    -0.014439    -0.011681
>> >> # N163    -0.054023    -0.049345    -0.037158    -0.04112
>-0.044612
>> >> -0.036953    -0.036061    -0.044516    -0.046436
>> >> # N193    -0.022171    -0.022384    -0.022338    -0.023304
>-0.022569
>> >> -0.021827    -0.021996    -0.021755    -0.021846
>> >>
>> >>
>> >> # sensors to keep
>> >>
>> >> sensors <-  c("N053", "N163")
>> >>
>> >>
>> >> library(iterators)
>> >>
>> >> library(rlist)
>> >>
>> >>
>> >> file_name <- "test.txt"
>> >>
>> >> con_obj <- file( file_name , "r")
>> >> ifile <- ireadLines( con_obj , n = 1 )
>> >>
>> >>
>> >> ## I do not do a loop for the example
>> >>
>> >> res <- list()
>> >>
>> >> r <- get_Lines_iter( ifile , sensors)
>> >> res <- list.append( res , r )
>> >> res
>> >> r <- get_Lines_iter( ifile , sensors)
>> >> res <- list.append( res , r )
>> >> res
>> >> r <- get_Lines_iter( ifile , sensors)
>> >> do.call("cbind",res)
>> >>
>> >> ## the function get_Lines_iter to select and process the line
>> >>
>> >> get_Lines_iter  <-  function( iter , sensors, sep = '\t', quiet =
>FALSE){
>> >>     ## read the next record in the iterator
>> >>     r = try( nextElem(iter) )
>> >>    while(  TRUE ){
>> >>       if( class(r) == "try-error") {
>> >>             return( stop("The iterator is empty") )
>> >>      } else {
>> >>      ## split the read line according to the separator
>> >>       r_txt <- textConnection(r)
>> >>       fields <- scan(file = r_txt, what = "character", sep = sep,
>quiet =
>> >> quiet)
>> >>        ## test if we have to keep the line
>> >>        if( fields[1] %in% sensors){
>> >>          ## data processing for the selected line (for the example
>> >> transformation in dataframe)
>> >>          n <- length(fields)
>> >>          x <- data.frame( as.numeric(fields[2:n]) )
>> >>          names(x) <- fields[1]
>> >>          ## We return the values
>> >>          print(paste0("sensor ",fields[1]," ok"))
>> >>          return( x )
>> >>        }else{
>> >>         print(paste0("Sensor ", fields[1] ," not selected"))
>> >>         r = try(nextElem(iter) )}
>> >>      }
>> >> }# end while loop
>> >> }
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> L'absence de virus dans ce courrier électronique a été vérifiée
>par le logiciel antivirus Avast.
>> >> https://www.avast.com/antivirus
>> >>
>> >>          [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> L'absence de virus dans ce courrier électronique a été vérifiée par
>le logiciel antivirus Avast.
>> https://www.avast.com/antivirus
>>
>
>______________________________________________
>R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.