[R-pkg-devel] Fast Matrix Serialization in R?

Simon Urbanek @|mon@urb@nek @end|ng |rom R-project@org
Fri May 10 05:12:17 CEST 2024



> On 10/05/2024, at 12:31 PM, Henrik Bengtsson <henrik.bengtsson using gmail.com> wrote:
> 
> On Thu, May 9, 2024 at 3:46 PM Simon Urbanek
> <simon.urbanek using r-project.org> wrote:
>> 
>> FWIW serialize() is binary so there is no conversion to text:
>> 
>>> serialize(1:10+0L, NULL)
>> [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
>> [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 00
>> [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a
>> 
>> It uses the native representation so it is actually not as bad as it sounds.
>> 
>> One aspect I forgot to mention in the earlier thread is that if you don't need to exchange the serialized objects between machines with different endianness then avoiding the swap makes it faster. E.g, on Intel (which is little-endian and thus needs swapping):
>> 
>>> a=1:1e8/2
>>> system.time(serialize(a, NULL))
>>   user  system elapsed
>>  2.123   0.468   2.661
>>> system.time(serialize(a, NULL, xdr=FALSE))
>>   user  system elapsed
>>  0.393   0.348   0.742
> 
> Would it be worth looking into making xdr=FALSE the default? From
> help("serialize"):
> 
> xdr: a logical: if a binary representation is used, should a
> big-endian one (XDR) be used?
> ...
> As almost all systems in current use are little-endian, xdr = FALSE
> can be used to avoid byte-shuffling at both ends when transferring
> data from one little-endian machine to another (or between processes
> on the same machine). Depending on the system, this can speed up
> serialization and unserialization by a factor of up to 3x.
> 
> This seems like a low-hanging fruit that could spare the world from
> wasting unnecessary CPU cycles.
> 


I thought about it before, but the main problem here is (as often) compatibility. The current default guarantees that the output can be safely read on any machine while xdr=FALSE only works if used on machines with the same endianness and will fail horribly otherwise. R cannot really know whether the user intends to transport the serialized data to another machine or not, so it cannot assume it is safe unless the user indicates so. Therefore all we can safely do is tell the users that they should use it where appropriate -- and the documentation explicitly says so:

     As almost all systems in current use are little-endian, ‘xdr =
     FALSE’ can be used to avoid byte-shuffling at both ends when
     transferring data from one little-endian machine to another (or
     between processes on the same machine).  Depending on the system,
     this can speed up serialization and unserialization by a factor of
     up to 3x.

Unfortunately, no one bothers to reads the documentation so it is not as effective as changing the default, but for reasons above it is just not as easy to change. I do acknowledge that the risk is relatively low since big-endian machines are becoming rare, but it's not zero.

That said, what worries me a bit more is that some derived functions such as saveRDS() don't expose the xdr option, so you actually have no way to use the native binary format. I understand the logic - see above, but as you said, that makes them unnecessarily slow. I wonder if it may be worth doing something a bit smarter and tag officially a "reverse XDR" format instead - that way it would be well-defined and could be made the default. Interestingly, the de-serialization part actually doesn't care, so you can use readRDS() on the binary serialization even in current R versions, so just adding the option would still be backwards-compatible. Definitely something to think about...

Cheers,
Simon


> 
> 
>> 
>> Cheers,
>> Simon
>> 
>> ______________________________________________
>> R-package-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 



More information about the R-package-devel mailing list