[R] reading VERY large binary files

Duncan Murdoch murdoch at stats.uwo.ca
Wed Nov 8 01:45:42 CET 2006


On 11/7/2006 4:54 PM, Matt Anthony wrote:
> Hello, 
> 
>  
> 
> I am trying to read in elements out of a very large binary file ... the
> total file is 4 gigs. I want to select rows out of the file, and the
> current procedure I run works but is prohibitively slow (takes more than
> a day to run and still won't complete). Is there any faster way to
> accomplish this?

You are doing several things that are likely to be slow.
> 
>  
> 
> My current procedure looks like this:
> 
>  
> 
> readHH <- function(file_name, hhid_list) {
> 
> incon=file(file_name, open="rb")
> 
> result=data.frame()
> 
> tran=list()
> 
> byte_mark=0
> 
> last_1M_mod=0
> 
> file_size=file.info(file_name)$size
> 
> write.table(paste("Data pulled from", file_name, sep=" "),
> file="readHH_output.txt", sep=",", row.names=FALSE, col.names=FALSE,
> append=TRUE)
> 
> while (TRUE) {
> 
>     tran$hh_id <- readBin(incon,integer(),1,size=4)

Why use a function call integer() here, rather than just the character 
string "integer"?
> 
>     if(is.element(tran$hh_id, hhid_list)) {

You don't show us the is.element() function, but since it's going to be 
called a lot, it might be a place for an optimization.

> 
>         tran$prov_id <- readBin(incon,integer(),1,size=2)
> 
>         tran$txn_dn <- readBin(incon,integer(),1,size=2)
> 
>         tran$total_dollars <- readBin(incon,integer(),1,size=4)
> 
>         tran$total_items <- readBin(incon,integer(),1,size=4)
> 
>         tran$order_id <- readBin(incon,integer(),1,size=4)
> 
>         tran$txn_type <- readChar(incon,1)
> 
>         tran$gender <- readChar(incon,1)
> 
>         tran$zip_code <- readChar(incon,5)
> 
>         tran$region_code <- readChar(incon,1)
> 
>         tran$county_code <- readChar(incon,1)
> 
>         tran$state_abbrev <- readChar(incon,2)
> 
>         tran$channel_code <- readChar(incon,1)
> 
>         tran$source_code <- readChar(incon,20)
> 
>         tran$payment_type <- readChar(incon,1)
> 
>         tran$credit_card <- readChar(incon,1)
> 
>         tran$promo_type <- readChar(incon,1)
> 
>         tran$flags <- readChar(incon,1)

You could probably make all of this a lot faster by combining it into 
three calls:

readBin(ints2, "integer", 2, size=2)
readBin(ints4, "integer", 4, size=4)
readChar(chars, 36)

and then extracting the elements after reading.  The extraction will 
probably be pretty fast, especially if you put the results into matrices 
rather than data.frames.  data.frames are hugely slower than matrices.

> 
>         write.table(data.frame(tran), file="readHH_output", sep=",",
> row.names=FALSE, col.names=FALSE, append=TRUE)

This is going to reopen, seek, and close the file each time.  Do you 
really need to do that?  Can't you open the output file once, then just 
write the data to it?

> 
>         result <- rbind(result,data.frame(tran))

This is also very slow.  It needs to grow a big list of vectors (which 
is how result is stored) every time you read a record.  It would be 
faster if you could pre-allocate the result, and just assign values into 
it, especially if you were assigning into a matrix, not a dataframe.

I don't know which of these suggestions will have the biggest effect. 
I'd suggest trying them one by one until things are fast enough, and 
then going on to something else.

Duncan Murdoch

> 
>         }
> 
>     else {
> 
>         byte_mark=byte_mark+42
> 
>         if (byte_mark>=file_size) {break}
> 
>         else {seek(incon, where=byte_mark)}
> 
>         }
> 
>     }
> 
> return(result)
> 
> }
> 
>  
> 
> Thanks
> 
>  
> 
> Matt
> 
>  
> 
>  
> 
>  
> 
>  
> 
> Matt Anthony | Senior Statistician| 303.327.1761 |
> matt.anthony at NextAction.Net
> 10155 Westmoor Drive | Westminster, CO 80021 | FAX 303.327.1650
> 
>  
> 
>  
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list