[R] iterators : checkFunc with ireadLines

Laurent Rhelp L@urentRHe|p @end|ng |rom |ree@|r
Tue May 19 09:07:38 CEST 2020


Ok, thank you for the advice I will take some time to see in details 
these packages.


Le 19/05/2020 à 05:44, Jeff Newmiller a écrit :
> Laurent... Bill is suggesting building your own indexed database... but this has been done before, so re-inventing the wheel seems inefficient and risky. It is actually impossible to create such a beast without reading the entire file into memory at least temporarily anyway, so you are better off looking at ways to process the entire file efficiently.
>
> For example, you could load the data into a sqlite database in a couple of lines of code and use SQL directly or use the sqldf data frame interface, or use dplyr to query the database.
>
> Or you could look at read_csv_chunked from readr package.
>
> On May 18, 2020 11:37:46 AM PDT, William Michels via R-help <r-help using r-project.org> wrote:
>> Hi Laurent,
>>
>> Thank you for explaining your size limitations. Below is an example
>> using the read.fwf() function to grab the first column of your input
>> file (in 2000 row chunks). This column is converted to an index, and
>> the index is used to create an iterator useful for skipping lines when
>> reading input with scan(). (You could try processing your large file
>> in successive 2000 line chunks, or whatever number of lines fits into
>> memory). Maybe not as elegant as the approach you were going for, but
>> read.fwf() should be pretty efficient:
>>
>>> sensors <-  c("N053", "N163")
>>> read.fwf("test2.txt", widths=c(4), as.is=TRUE, flush=TRUE, n=2000,
>> skip=0)
>>     V1
>> 1 Time
>> 2 N023
>> 3 N053
>> 4 N123
>> 5 N163
>> 6 N193
>>> first_col <- read.fwf("test2.txt", widths=c(4), as.is=TRUE,
>> flush=TRUE, n=2000, skip=0)
>>> which(first_col$V1 %in% sensors)
>> [1] 3 5
>>> index1 <- which(first_col$V1 %in% sensors)
>>> iter_index1 <- iter(1:2000, checkFunc= function(n) {n %in% index1})
>>> unlist(scan(file="test2.txt",
>> what=list("","","","","","","","","",""), flush=TRUE, multi.line=FALSE,
>> skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
>> [1] "N053"      "-0.014083" "-0.004741" "0.001443"  "-0.010152"
>> "-0.012996" "-0.005337" "-0.008738" "-0.015094" "-0.012104"
>>> unlist(scan(file="test2.txt",
>> what=list("","","","","","","","","",""), flush=TRUE, multi.line=FALSE,
>> skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
>> [1] "N163"      "-0.054023" "-0.049345" "-0.037158" "-0.04112"
>> "-0.044612" "-0.036953" "-0.036061" "-0.044516" "-0.046436"
>> (Note for this email and the previous one, I've deleted the first
>> "hash" character from each line of your test file for clarity).
>>
>> HTH, Bill.
>>
>> W. Michels, Ph.D.
>>
>>
>>
>>
>>
>> On Mon, May 18, 2020 at 3:35 AM Laurent Rhelp <LaurentRHelp using free.fr>
>> wrote:
>>> Dear William,
>>>    Thank you for your answer
>>> My file is very large so I cannot read it in my memory (I cannot use
>>> read.table). So I want to put in memory only the line I need to
>> process.
>>> With readLines, as I did, it works but I would like to use an
>> iterator
>>> and a foreach loop to understand this way to do because I thought
>> that
>>> it was a better solution to write a nice code.
>>>
>>>
>>> Le 18/05/2020 à 04:54, William Michels a écrit :
>>>> Apologies, Laurent, for this two-part answer. I misunderstood your
>>>> post where you stated you wanted to "filter(ing) some
>>>> selected lines according to the line name... ." I thought that
>> meant
>>>> you had a separate index (like a series of primes) that you wanted
>> to
>>>> use to only read-in selected line numbers from a file (test file
>> below
>>>> with numbers 1:1000 each on a separate line):
>>>>
>>>>> library(gmp)
>>>>> library(iterators)
>>>>> iprime <- iter(1:100, checkFunc = function(n) isprime(n))
>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>>>> Read 1 item
>>>> [1] 2
>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>>>> Read 1 item
>>>> [1] 3
>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>>>> Read 1 item
>>>> [1] 5
>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)
>>>> Read 1 item
>>>> [1] 7
>>>> However, what it really seems that you want to do is read each line
>> of
>>>> a (possibly enormous) file, test each line "string-wise" to keep or
>>>> discard, and if you're keeping it, append the line to a list. I can
>>>> certainly see the advantage of this strategy for reading in very,
>> very
>>>> large files, but it's not clear to me how the "ireadLines" function
>> (
>>>> in the "iterators" package) will help you, since it doesn't seem to
>>>> generate anything but a sequential index.
>>>>
>>>> Anyway, below is an absolutely standard read-in of your data using
>>>> read.table(). Hopefully some of the code I've posted has been
>> useful
>>>> to you.
>>>>
>>>>> sensors <-  c("N053", "N163")
>>>>> read.table("test2.txt")
>>>>       V1        V2        V3        V4        V5        V6        V7
>>>>      V8        V9       V10
>>>> 1 Time  0.000000  0.000999  0.001999  0.002998  0.003998  0.004997
>>>> 0.005997  0.006996  0.007996
>>>> 2 N023 -0.031323 -0.035026 -0.029759 -0.024886 -0.024464 -0.026816
>>>> -0.033690 -0.041067 -0.038747
>>>> 3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
>>>> -0.008738 -0.015094 -0.012104
>>>> 4 N123 -0.019008 -0.013494 -0.013180 -0.029208 -0.032748 -0.020243
>>>> -0.015089 -0.014439 -0.011681
>>>> 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953
>>>> -0.036061 -0.044516 -0.046436
>>>> 6 N193 -0.022171 -0.022384 -0.022338 -0.023304 -0.022569 -0.021827
>>>> -0.021996 -0.021755 -0.021846
>>>>> Laurent_data <- read.table("test2.txt")
>>>>> Laurent_data[Laurent_data$V1 %in% sensors, ]
>>>>       V1        V2        V3        V4        V5        V6        V7
>>>>      V8        V9       V10
>>>> 3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
>>>> -0.008738 -0.015094 -0.012104
>>>> 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953
>>>> -0.036061 -0.044516 -0.046436
>>>>
>>>> Best, Bill.
>>>>
>>>> W. Michels, Ph.D.
>>>>
>>>>
>>>> On Sun, May 17, 2020 at 5:43 PM Laurent Rhelp
>> <LaurentRHelp using free.fr> wrote:
>>>>> Dear R-Help List,
>>>>>
>>>>>       I would like to use an iterator to read a file filtering some
>>>>> selected lines according to the line name in order to use after a
>>>>> foreach loop. I wanted to use the checkFunc argument as the
>> following
>>>>> example found on internet to select only prime numbers :
>>>>>
>>>>> |                                iprime <- ||iter||(1:100,
>> checkFunc =
>>>>> ||function||(n) ||isprime||(n))|
>>>>>
>>>>> |(https://datawookie.netlify.app/blog/2013/11/iterators-in-r/)
>>>>> <https://datawookie.netlify.app/blog/2013/11/iterators-in-r/>|
>>>>>
>>>>> but the checkFunc argument seems not to be available with the
>> function
>>>>> ireadLines (package iterators). So, I did the code below to solve
>> my
>>>>> problem but I am sure that I miss something to use iterators with
>> files.
>>>>> Since I found nothing on the web about ireadLines and the
>> checkFunc
>>>>> argument, could somebody help me to understand how we have to use
>>>>> iterator (and foreach loop) on files keeping only selected lines ?
>>>>>
>>>>> Thank you very much
>>>>> Laurent
>>>>>
>>>>> Presently here is my code:
>>>>>
>>>>> ##        mock file to read: test.txt
>>>>> ##
>>>>> # Time    0    0.000999    0.001999    0.002998    0.003998
>> 0.004997
>>>>> 0.005997    0.006996    0.007996
>>>>> # N023    -0.031323    -0.035026    -0.029759    -0.024886
>> -0.024464
>>>>> -0.026816    -0.03369    -0.041067    -0.038747
>>>>> # N053    -0.014083    -0.004741    0.001443    -0.010152
>> -0.012996
>>>>> -0.005337    -0.008738    -0.015094    -0.012104
>>>>> # N123    -0.019008    -0.013494    -0.01318    -0.029208
>> -0.032748
>>>>> -0.020243    -0.015089    -0.014439    -0.011681
>>>>> # N163    -0.054023    -0.049345    -0.037158    -0.04112
>> -0.044612
>>>>> -0.036953    -0.036061    -0.044516    -0.046436
>>>>> # N193    -0.022171    -0.022384    -0.022338    -0.023304
>> -0.022569
>>>>> -0.021827    -0.021996    -0.021755    -0.021846
>>>>>
>>>>>
>>>>> # sensors to keep
>>>>>
>>>>> sensors <-  c("N053", "N163")
>>>>>
>>>>>
>>>>> library(iterators)
>>>>>
>>>>> library(rlist)
>>>>>
>>>>>
>>>>> file_name <- "test.txt"
>>>>>
>>>>> con_obj <- file( file_name , "r")
>>>>> ifile <- ireadLines( con_obj , n = 1 )
>>>>>
>>>>>
>>>>> ## I do not do a loop for the example
>>>>>
>>>>> res <- list()
>>>>>
>>>>> r <- get_Lines_iter( ifile , sensors)
>>>>> res <- list.append( res , r )
>>>>> res
>>>>> r <- get_Lines_iter( ifile , sensors)
>>>>> res <- list.append( res , r )
>>>>> res
>>>>> r <- get_Lines_iter( ifile , sensors)
>>>>> do.call("cbind",res)
>>>>>
>>>>> ## the function get_Lines_iter to select and process the line
>>>>>
>>>>> get_Lines_iter  <-  function( iter , sensors, sep = '\t', quiet =
>> FALSE){
>>>>>      ## read the next record in the iterator
>>>>>      r = try( nextElem(iter) )
>>>>>     while(  TRUE ){
>>>>>        if( class(r) == "try-error") {
>>>>>              return( stop("The iterator is empty") )
>>>>>       } else {
>>>>>       ## split the read line according to the separator
>>>>>        r_txt <- textConnection(r)
>>>>>        fields <- scan(file = r_txt, what = "character", sep = sep,
>> quiet =
>>>>> quiet)
>>>>>         ## test if we have to keep the line
>>>>>         if( fields[1] %in% sensors){
>>>>>           ## data processing for the selected line (for the example
>>>>> transformation in dataframe)
>>>>>           n <- length(fields)
>>>>>           x <- data.frame( as.numeric(fields[2:n]) )
>>>>>           names(x) <- fields[1]
>>>>>           ## We return the values
>>>>>           print(paste0("sensor ",fields[1]," ok"))
>>>>>           return( x )
>>>>>         }else{
>>>>>          print(paste0("Sensor ", fields[1] ," not selected"))
>>>>>          r = try(nextElem(iter) )}
>>>>>       }
>>>>> }# end while loop
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> L'absence de virus dans ce courrier électronique a été vérifiée
>> par le logiciel antivirus Avast.
>>>>> https://www.avast.com/antivirus
>>>>>
>>>>>           [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> --
>>> L'absence de virus dans ce courrier électronique a été vérifiée par
>> le logiciel antivirus Avast.
>>> https://www.avast.com/antivirus
>>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



-- 
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus



More information about the R-help mailing list