[R-pkg-devel] Fast Matrix Serialization in R?

Simon Urbanek @|mon@urb@nek @end|ng |rom R-project@org
Thu May 9 07:26:45 CEST 2024


Sameh,

if it's a matrix, that's easy as you can write it directly which is the fastest possible way without compression - e.g. quick proof of concept:

n <- 20000^2
A <- matrix(runif(n), ncol = sqrt(n))

## write (dim + payload)
con <- file(description = "matrix_file", open = "wb")
system.time({
writeBin(d <- dim(A), con)
dim(A)=NULL
writeBin(A, con)
dim(A)=d
})
close(con)

## read
con <- file(description = "matrix_file", open = "rb")
system.time({
d <- readBin(con, 1L, 2)
A1 <- readBin(con, 1, d[1] * d[2])
dim(A1) <- d
})
close(con)
identical(A, A1)

   user  system elapsed 
  0.931   2.713   3.644 
   user  system elapsed 
  0.089   1.360   1.451 
[1] TRUE

So it's really just limited by the speed of your disk, parallelization won't help here.

Note that in general you get faster read times by using compression as most data is reasonably compressible, so that's where parallelization can be useful. There are plenty of package with more tricks like mmapping the files etc., but the above is just base R.

Cheers,
Simon



> On 9/05/2024, at 3:20 PM, Sameh Abdulah <sameh.abdulah using kaust.edu.sa> wrote:
> 
> Hi,
> 
> I need to serialize and save a 20K x 20K matrix as a binary file. This process is significantly slower in R compared to Python (4X slower).
> 
> I'm not sure about the best approach to optimize the below code. Is it possible to parallelize the serialization function to enhance performance?
> 
> 
>  n <- 20000^2
>  cat("Generating matrices ... ")
>  INI.TIME <- proc.time()
>  A <- matrix(runif(n), ncol = m)
>  END_GEN.TIME <- proc.time()
>  arg_ser <- serialize(object = A, connection = NULL)
> 
>  END_SER.TIME <- proc.time()
>  con <- file(description = "matrix_file", open = "wb")
>  writeBin(object = arg_ser, con = con)
>  close(con)
>  END_WRITE.TIME <- proc.time()
>  con <- file(description = "matrix_file", open = "rb")
>  par_raw <- readBin(con, what = raw(), n = file.info("matrix_file")$size)
>  END_READ.TIME <- proc.time()
>  B <- unserialize(connection = par_raw)
>  close(con)
>  END_DES.TIME <- proc.time()
>  TIME <- END_GEN.TIME - INI.TIME
>  cat("Generation time", TIME[3], " seconds.")
> 
>  TIME <- END_SER.TIME - END_GEN.TIME
>  cat("Serialization time", TIME[3], " seconds.")
> 
>  TIME <- END_WRITE.TIME - END_SER.TIME
>  cat("Writting time", TIME[3], " seconds.")
> 
>  TIME <- END_READ.TIME - END_WRITE.TIME
>  cat("Read time", TIME[3], " seconds.")
> 
>  TIME <- END_DES.TIME - END_READ.TIME
>  cat("Deserialize time", TIME[3], " seconds.")
> 
> 
> 
> 
> Best,
> --Sameh
> 
> -- 
> 
> This message and its contents, including attachments are intended solely 
> for the original recipient. If you are not the intended recipient or have 
> received this message in error, please notify me immediately and delete 
> this message from your computer system. Any unauthorized use or 
> distribution is prohibited. Please consider the environment before printing 
> this email.
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 



More information about the R-package-devel mailing list