[Rd] Choices to remove `srcref` (and its buddies) when serializing objects

Charlie Gao ch@r||e@g@o @end|ng |rom @h|kokuchuo@net
Thu Jan 18 16:39:36 CET 2024


> ------------------------------
> 
> Date: Wed, 17 Jan 2024 11:35:02 -0500
> 
> From: Dipterix Wang <dipterix.wang using gmail.com>
> 
> To: Lionel Henry <lionel using posit.co>, Tomas Kalibera
> 
>  <tomas.kalibera using gmail.com>
> 
> Cc: r-devel using r-project.org
> 
> Subject: Re: [Rd] Choices to remove `srcref` (and its buddies) when
> 
>  serializing objects
> 
> Message-ID: <3CF4CA2D-9F72-4C7B-90AA-4D2E9F745430 using gmail.com>
> 
> Content-Type: text/plain; charset="utf-8"
> 
> > 
> > 
> >  
> > 
> >  On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
> > 
> >  <tomas.kalibera using gmail.com> wrote:
> > 
> > > 
> > > I think one could implement hashing on the fly without any
> > > 
> > >  serialization, similarly to how identical works, but I am not aware of
> > > 
> > >  any existing implementation. Again, if that wasn't clear: I don't think
> > > 
> > >  trying to compute a hash of an object from its serialized representation
> > > 
> > >  is a good idea - it is of course convenient, but has problems like the
> > > 
> > >  one you have ran into.
> > > 
> > >  
> > > 
> > >  In some applications it may still be good enough: if by various tweaks,
> > > 
> > >  such as ensuring source references are off in your case, you achieve a
> > > 
> > >  state when false alarms are rare (identical objects have different
> > > 
> > >  hashes), and hence say unnecessary re-computation is rare, maybe it is
> > > 
> > >  good enough.
> > >
> > 
> 
> I really appreciate you answer my questions and solve my puzzles. I went back and read the R internal code for `serialize` and totally agree on this, that serialization is not a good idea for digesting R objects, especially on environments, expressions, and functions. 
> 
> What I want is a function that can produce the same and stable hash for identical objects. However, there is no function (given our best knowledge) on the market that can do this. `digest::digest` and `rlang::hash` are the first functions that come into my mind. Both are widely used, but they use serialize. The author of `digest` said:
> 
>  > "As you know, digest takes and (ahem) "digests" what serialize gives it, so you would have to look into what serialize lets you do."
> 
> vctrs:::obj_hash is probably the closest to the implementation of `identical`, but the above examples give different results for identical objects.
> 
> The existence of digest:: digest and rlang::hash shows that there is a huge demand for this "ideal" hash function. However, I bet most people are using digest/hash "incorrectly".

Please read the full discussion to this old bug report: https://bugs.r-project.org/show_bug.cgi?id=18178

Quoting briefly: Serialization is not intended to be used this way. What serialization tries to provide is that x and unserialize(serialize(x, NULL)) will be identical() while preserving internal representation where possible. Two objects that are considered identical() can have very different internal representations, and their serializations will reflect this.

You will see that it is not as simple as just removing the srcref or the bytecode to functions. The issue with the `identical()` function in that context was eventually patched, but the comment by R-Core that serialization is not intended to be used to produce a reliable hash stands. Use of `identical()` or `serialize()` is simply not designed to ensure the same hashable object (in terms of bytes).

This is echoed by Tomas' comment above. But we note that it is 'good enough' in most cases.

Fwiw `nanonext::sha256()` and family directly hashes character strings and raw objects, but uses the same approach as `digest::digest()` elsewhere. So if someone comes up with a canonical binary representation of R objects, it will be able to hash it reliably.



More information about the R-devel mailing list