[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
Dipterix Wang
d|pter|x@w@ng @end|ng |rom gm@||@com
Wed Jan 17 17:35:02 CET 2024
>
> We have one in vctrs but it's not exported:
> https://github.com/r-lib/vctrs/blob/main/src/hash.c
>
> The main use is vectorised hashing:
>
Thanks for showing me this function. I have read the source code. That's a great idea.
However, I think I might have missed something. When I tried vctrs::obj_hash, I couldn't get identical outputs.
``` r
options(keep.source = TRUE)
a <- function(){}
vctrs:::obj_hash(a)
#> [1] 68 e8 5a 0c
a <- function(){}
vctrs:::obj_hash(a)
#> [1] b2 6a 55 9c
a <- function(){}
vctrs:::obj_hash(a)
#> [1] 01 a9 bc 30
options(keep.source = FALSE)
a <- function(){}
vctrs:::obj_hash(a)
#> [1] 93 d7 f2 72
a <- function(){}
vctrs:::obj_hash(a)
#> [1] f3 1d d2 f4
```
Created on 2024-01-17 with [reprex v2.1.0](https://reprex.tidyverse.org)
>
> Best,
> Lionel
>
> On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
> <tomas.kalibera using gmail.com> wrote:
>>
>> I think one could implement hashing on the fly without any
>> serialization, similarly to how identical works, but I am not aware of
>> any existing implementation. Again, if that wasn't clear: I don't think
>> trying to compute a hash of an object from its serialized representation
>> is a good idea - it is of course convenient, but has problems like the
>> one you have ran into.
>>
>> In some applications it may still be good enough: if by various tweaks,
>> such as ensuring source references are off in your case, you achieve a
>> state when false alarms are rare (identical objects have different
>> hashes), and hence say unnecessary re-computation is rare, maybe it is
>> good enough.
I really appreciate you answer my questions and solve my puzzles. I went back and read the R internal code for `serialize` and totally agree on this, that serialization is not a good idea for digesting R objects, especially on environments, expressions, and functions.
What I want is a function that can produce the same and stable hash for identical objects. However, there is no function (given our best knowledge) on the market that can do this. `digest::digest` and `rlang::hash` are the first functions that come into my mind. Both are widely used, but they use serialize. The author of `digest` said:
> "As you know, digest takes and (ahem) "digests" what serialize gives it, so you would have to look into what serialize lets you do."
vctrs:::obj_hash is probably the closest to the implementation of `identical`, but the above examples give different results for identical objects.
The existence of digest:: digest and rlang::hash shows that there is a huge demand for this "ideal" hash function. However, I bet most people are using digest/hash "incorrectly".
>>
>> Tomas
>>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list