[R] Accelerating binRead

Philippe de Rochambeau phiroc at free.fr
Sun Sep 18 17:02:00 CEST 2016


I would gladly examine your example, Mike.
Cheers,
Philippe

> Le 18 sept. 2016 à 16:05, Michael Sumner <mdsumner at gmail.com> a écrit :
> 
> 
> 
>> On Sun, 18 Sep 2016, 19:04 Philippe de Rochambeau <phiroc at free.fr> wrote:
>> Please find below code that attempts to read ints, longs and floats from a binary file (which is a simplification of my original program).
>> Please disregard the R inefficiencies, such as using rbind, for now.
>> I’ve also included Java code to generate the binary file.
>> The output shows that, at one point, anInt becomes undefined. Unfortunately, I couldn’t find the correct R function to determine whether inInt is undefined or not, as is.null, is.nan, and is.infinite don’t work.
>> Any help would be much appreciated.
>> Many thanks in advance.
>> Philippe
>> 
>> ———————
>> [1] "anInt = 1"
>> [1] "is.null  FALSE"
>> [1] "is.nan  FALSE"
>> [1] "is.infinite  FALSE"
>> [1] "aLong = 2"
>> [1] "aFloat = 3.44440007209778"
>> [1] "--------------------------"
>> [1] "anInt = 2"
>> [1] "is.null  FALSE"
>> [1] "is.nan  FALSE"
>> [1] "is.infinite  FALSE"
>> [1] "aLong = 22"
>> [1] "aFloat = 13.4644002914429"
>> [1] "--------------------------"
>> [1] "anInt = 3"
>> [1] "is.null  FALSE"
>> [1] "is.nan  FALSE"
>> [1] "is.infinite  FALSE"
>> [1] "aLong = 55"
>> [1] "aFloat = 45.4444007873535"
>> [1] "--------------------------"
>> [1] "anInt = "
>> [1] "is.null  FALSE"
>> [1] "is.nan  "
>> [1] "is.infinite  "
>> [1] "aLong = "
>> [1] "aFloat = "
>> [1] "--------------------------"
>>      [,1]      [,2]      [,3]
>> [1,] 1         2         3.4444
>> [2,] 2         22        13.4644
>> [3,] 3         55        45.4444
>> [4,] Integer,0 Integer,0 Numeric,0
>> >
>> 
>> -----------
>> 
>> 
>> —————————————————————
>> 
>> readFile <- function(inputPath) {
>>   URL <- file(inputPath, "rb")
>>   PLT <- matrix(nrow=0, ncol=3)
>>   counte <- 0
>>   max <- 4
>>   while (counte < max) {
>>     anInt <- readBin(con=URL, what=integer(), size=4, n=1, endian="big")
>>     print(paste("anInt =", anInt))
>>     #if (! (anInt == 0)) { print(paste("empty int")); break }
>>     print(paste("is.null ", is.null(anInt)))
>>     print(paste("is.nan ", is.nan(anInt)))
>>     print(paste("is.infinite ", is.infinite(anInt)))
>>     aLong <- readBin(URL, integer(), size=8, n=1, endian="big")
>>     print(paste("aLong =", aLong))
>>     aFloat <- readBin(URL, numeric(), size=4, n=1, endian="big")
>>     print(paste("aFloat =", aFloat))
>>     print("--------------------------")
>>     PLT <- rbind(PLT, list(anInt, aLong, aFloat))
>>     counte <- counte + 1
>>   } # end while
>>   close(URL)
>>   PLT
>> }
>> fichier <- "/Users/philippe/Desktop/datatests/data0.bin"
>> PLT2 <- readFile(fichier)
>> print(PLT2)
>> —————————————————————
>> 
>> import java.io.*;
>> 
>> public class Main {
>> 
>>         Main() {
>>                 writeData();
>>         }
>> 
>>         public static void main(String[] args) {
>>                 new Main();
>>         }
>> 
>>         public void writeData() {
>> 
>>                 final String path = "/Users/philippe/Desktop/datatests/data0.bin";
>> 
>>                 DataOutputStream dos;
>>                 try {
>>                         dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(path)));
>>                         // big endian write! ("high byte first") , see https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html
>>                         dos.writeInt(1);
>>                         dos.writeLong(2L);
>>                         dos.writeFloat(3.4444F);
>> 
>>                         dos.writeInt(2);
>>                         dos.writeLong(22L);
>>                         dos.writeFloat(13.4644F);
>> 
>>                         dos.writeInt(3);
>>                         dos.writeLong(55L);
>>                         dos.writeFloat(45.4444F);
>> 
>>                         dos.close();
>>                 } catch (FileNotFoundException e) {
>>                         e.printStackTrace();
>>                 } catch (IOException ioe) {
>>                         ioe.printStackTrace();
>>                 }
>> 
>>         }
>> 
>> }
>> 
>> 
>> —————————————————————
>> 
>> 
>> 
>> 
>> 
>> 
>> > Le 17 sept. 2016 à 20:45, Philippe de Rochambeau <phiroc at free.fr> a écrit :
>> >
>> > Hi Jim,
>> > this is exactly the answer I was look for. Many thanks. I didn’t R had a pack function, as in PERL.
>> > To answer your earlier question, I am trying to update legacy code to read a binary file with unknown size, over a network, slice up it into rows each containing an integer, an integer, a long, a short, a float and a float, and stuff the rows into a matrix.
> 
> 
> 
> It's possible to read all rows fast as raw(), then parse in a vectorised way with matrix indexing to group the bytes appropriately. There is an example on the mailing list somewhere, but otherwise I can show an example if that's of interest.  
> 
> 
> Cheers, Mike
> 
> 
>> > Best regards,
>> > Philippe
>> >
>> >> Le 17 sept. 2016 à 20:38, jim holtman <jholtman at gmail.com <mailto:jholtman at gmail.com>> a écrit :
>> >>
>> >> Here is an example of how to do it:
>> >>
>> >> x <- 1:10  # integer values
>> >> xf <- seq(1.0, 2, by = 0.1)  # floating point
>> >>
>> >> setwd("d:/temp")
>> >>
>> >> # create file to write to
>> >> output <- file('integer.bin', 'wb')
>> >> writeBin(x, output)  # write integer
>> >> writeBin(xf, output)  # write reals
>> >> close(output)
>> >>
>> >>
>> >> library(pack)
>> >> library(readr)
>> >>
>> >> # read all the data at once
>> >> allbin <- read_file_raw('integer.bin')
>> >>
>> >> # decode the data into a list
>> >> (result <- unpack("V V V V V V V V V V d d d d d d d d d d", allbin))
>> >>
>> >>
>> >>
>> >>
>> >> Jim Holtman
>> >> Data Munger Guru
>> >>
>> >> What is the problem that you are trying to solve?
>> >> Tell me what you want to do, not how you want to do it.
>> >>
>> >> On Sat, Sep 17, 2016 at 11:04 AM, Ismail SEZEN <sezenismail at gmail.com <mailto:sezenismail at gmail.com><mailto:sezenismail at gmail.com <mailto:sezenismail at gmail.com>>> wrote:
>> >> I noticed same issue but didnt care much :)
>> >>
>> >> On Sat, Sep 17, 2016, 18:01 jim holtman <jholtman at gmail.com <mailto:jholtman at gmail.com> <mailto:jholtman at gmail.com <mailto:jholtman at gmail.com>>> wrote:
>> >> Your example was not reproducible.  Also how do you "break" out of the
>> >> "while" loop?
>> >>
>> >>
>> >> Jim Holtman
>> >> Data Munger Guru
>> >>
>> >> What is the problem that you are trying to solve?
>> >> Tell me what you want to do, not how you want to do it.
>> >>
>> >> On Sat, Sep 17, 2016 at 8:05 AM, Philippe de Rochambeau <phiroc at free.fr <mailto:phiroc at free.fr> <mailto:phiroc at free.fr <mailto:phiroc at free.fr>>>
>> >> wrote:
>> >>
>> >>> Hello,
>> >>> the following function, which stores numeric values extracted from a
>> >>> binary file, into an R matrix, is very slow, especially when the said file
>> >>> is several MB in size.
>> >>> Should I rewrite the function in inline C or in C/C++ using Rcpp? If the
>> >>> latter case is true, how do you « readBin »  in Rcpp (I’m a total Rcpp
>> >>> newbie)?
>> >>> Many thanks.
>> >>> Best regards,
>> >>> phiroc
>> >>>
>> >>>
>> >>> -------------
>> >>>
>> >>> # inputPath is something like http://myintranet/getData <http://myintranet/getData><http://myintranet/getData <http://myintranet/getData>>?
>> >>> pathToFile=/usr/lib/xxx/yyy/data.bin <http://myintranet/getData <http://myintranet/getData> <http://myintranet/getData <http://myintranet/getData>>?
>> >>> pathToFile=/usr/lib/xxx/yyy/data.bin>
>> >>>
>> >>> PLTreader <- function(inputPath){
>> >>>        URL <- file(inputPath, "rb")
>> >>>        PLT <- matrix(nrow=0, ncol=6)
>> >>>        compteurDePrints = 0
>> >>>        compteurDeLignes <- 0
>> >>>        maxiPrints = 5
>> >>>        displayData <- FALSE
>> >>>        while (TRUE) {
>> >>>                periodIndex <- readBin(URL, integer(), size=4, n=1,
>> >>> endian="little") # int (4 bytes)
>> >>>                eventId <- readBin(URL, integer(), size=4, n=1,
>> >>> endian="little") # int (4 bytes)
>> >>>                dword1 <- readBin(URL, integer(), size=4, signed=FALSE,
>> >>> n=1, endian="little") # int
>> >>>                dword2 <- readBin(URL, integer(), size=4, signed=FALSE,
>> >>> n=1, endian="little") # int
>> >>>                if (dword1 < 0) {
>> >>>                        dword1 = dword1 + 2^32-1;
>> >>>                }
>> >>>                eventDate = (dword2*2^32 + dword1)/1000
>> >>>                repNum <- readBin(URL, integer(), size=2, n=1,
>> >>> endian="little") # short (2 bytes)
>> >>>                exp <- readBin(URL, numeric(), size=4, n=1,
>> >>> endian="little") # float (4 bytes, strangely enough, would expect 8)
>> >>>                loss <- readBin(URL, numeric(), size=4, n=1,
>> >>> endian="little") # float (4 bytes)
>> >>>                PLT <- rbind(PLT, c(periodIndex, eventId, eventDate,
>> >>> repNum, exp, loss))
>> >>>        } # end while
>> >>>        return(PLT)
>> >>>        close(URL)
>> >>> }
>> >>>
>> >>> ----------------
>> >>>        [[alternative HTML version deleted]]
>> >>>
>> >>> ______________________________________________
>> >>> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To UNSUBSCRIBE and more, see
>> >>> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help><https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>>
>> >>> PLEASE do read the posting guide http://www.R-project.org/ <http://www.r-project.org/> <http://www.r-project.org/ <http://www.r-project.org/>>
>> >>> posting-guide.html
>> >>> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >>        [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help><https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>>
>> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html> <http://www.r-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>>
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
>> > and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
>>         [[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> -- 
> Dr. Michael Sumner
> Software and Database Engineer
> Australian Antarctic Division
> 203 Channel Highway
> Kingston Tasmania 7050 Australia
> 

	[[alternative HTML version deleted]]



More information about the R-help mailing list