[Rd] [External] Re: Choices to remove `srcref` (and its buddies) when serializing objects

iuke-tier@ey m@iii@g oii uiow@@edu iuke-tier@ey m@iii@g oii uiow@@edu
Thu Jan 18 16:59:31 CET 2024


On Thu, 18 Jan 2024, Ivan Krylov via R-devel wrote:

> В Tue, 16 Jan 2024 14:16:19 -0500
> Dipterix Wang <dipterix.wang using gmail.com> пишет:
>
>> Could you recommend any packages/functions that compute hash such
>> that the source references and sexpinfo_struct are ignored? Basically
>> a version of `serialize` that convert R objects to raw without
>> storing the ancillary source reference and sexpinfo.
>
> I can show how this can be done, but it's not currently on CRAN or even
> a well-defined package API. I have adapted a copy of R's serialize()
> [*] with the following changes:
>
> * Function bytecode and flags are ignored:
>
> f <- function() invisible()
> depcache:::hash(f, 2) # This is plain FNV1a-64 of serialize() output
> # [1] "9b7a1af5468deba4"
> .Call(depcache:::C_hash2, f) # This is the new hash
> [1] 91 5f b8 a1 b0 6b cb 40
> f() # called once: function gets the MAYBEJIT_MASK flag
> depcache:::hash(f, 2)
> # [1] "7d30e05546e7a230"
> .Call(depcache:::C_hash2, f)
> # [1] 91 5f b8 a1 b0 6b cb 40
> f() # called twice: function now has bytecode
> depcache:::hash(f, 2)
> # [1] "2a2cba4150e722b8"
> .Call(depcache:::C_hash2, f)
> # [1] 91 5f b8 a1 b0 6b cb 40 # new hash stays the same
>
> * Source references are ignored:
>
> .Call(depcache:::C_hash2, \( ) invisible( ))
> # [1] 91 5f b8 a1 b0 6b cb 40 # compare vs. above
>
> # For quoted function definitions, source references have to be handled
> # differently
> .Call(depcache:::C_hash2, quote(function(){}))
> [1] 58 0d 44 8e d4 fd 37 6f
> .Call(depcache:::C_hash2, quote(\( ){      }))
> [1] 58 0d 44 8e d4 fd 37 6f
>
> * ALTREP is ignored:
>
> identical(1:10, 1:10+0L)
> # [1] TRUE
> identical(serialize(1:10, NULL), serialize(1:10+0L, NULL))
> # [1] FALSE
> identical(
> .Call(depcache:::C_hash2, 1:10),
> .Call(depcache:::C_hash2, 1:10+0L)
> )
> # [1] TRUE
>
> * Strings not marked as bytes are encoded into UTF-8:
>
> identical('\uff', iconv('\uff', 'UTF-8', 'latin1'))
> # [1] TRUE
> identical(
> serialize('\uff', NULL),
> serialize(iconv('\uff', 'UTF-8', 'latin1'), NULL)
> )
> # [1] FALSE
> identical(
> .Call(depcache:::C_hash2, '\uff'),
> .Call(depcache:::C_hash2, iconv('\uff', 'UTF-8', 'latin1'))
> )
> # [1] TRUE
>
> * NaNs with different payloads (except NA_numeric_) are replaced by
>   R_NaN.
>
> One of the many downsides to the current approach is that we rely on
> the non-API entry point getPRIMNAME() in order to hash builtins.
> Looking at the source code for identical() is no help here, because it
> uses the private PRIMOFFSET macro.
>
> The bitstream being hashed is also, unfortunately, not exactly
> compatible with R serialization format version 2: I had to ignore the
> LEVELS of the language objects being hashed both because identical()
> seems to ignore those and because I was missing multiple private
> definitions (e.g. the MAYBEJIT flag) to handle them properly.
>
> Then there's also the problem of immediate bindings [**]: I've seen bits
> of vctrs, rstudio, rlang blow up when calling CAR() on SEXP objects that
> are not safe to handle this way, but R_expand_binding_value() (used by
> serialize()) is again a private function that is not accessible from
> packages. identical() won't help here, because it compares reference
> objects (which may or may not contain such immediate bindings) by their
> pointer values instead of digging down into them.

What does 'blow up' mean? If it is anything other than signal a "bad
binding access" error then it would be good to have more details.

Best,

luke

> Dropping the (already violated) requirement to be compatible with R
> serialization bitstream will make it possible to simplify the code
> further.
>
> Finally:
>
> a <- new.env()
> b <- new.env()
> a$x <- b$x <- 42
> identical(a, b)
> # [1] FALSE
> .Call(depcache:::C_hash2, a)
> # [1] 44 21 f1 36 5d 92 03 1b
> .Call(depcache:::C_hash2, b)
> # [1] 44 21 f1 36 5d 92 03 1b
>
> ...but that's unavoidable when looking at frozen object contents
> instead of their live memory layout.
>
> If you're interested, here's the development version of the package:
> install.packages('depcache',contriburl='https://aitap.github.io/Rpackages')
>
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu


More information about the R-devel mailing list