[Rd] Choices to remove `srcref` (and its buddies) when serializing objects

Ivan Krylov |kry|ov @end|ng |rom d|@root@org
Fri Jan 12 09:42:33 CET 2024


В Fri, 12 Jan 2024 00:11:45 -0500
Dipterix Wang <dipterix.wang using gmail.com> пишет:

> I wonder how hard it would be to have options to discard source when
> serializing R objects? 

> Currently my analyses heavily depend on digest function to generate
> file caches and automatically schedule pipelines (to update cache)
> when changes are detected.

Source references may be the main problem here, but not the only one.
There are also string encodings and function bytecode (which may or may
not be present and probably changes between R versions). I've been
collecting the ways that the objects that are identical() to each other
can serialize() differently in my package 'depcache'; I'm sure I missed
a few.

Admittedly, string encodings are less important nowadays (except on
older Windows and weirdly set up Unix-like systems). Thankfully, the
digest package already knows to skip the serialization header (which
contains the current version of R).

serialize() only knows about basic types [*], and source references are
implemented on top of these as objects of class 'srcref'. Sometimes
they are attached as attributes to other objects, other times (e.g. in
quote(function(){}), [**]) just sitting there as arguments to a call.

Sometimes you can hash the output of deparse(x) instead of serialize(x)
[***]. Text representations aren't without their own problems (e.g.
IEEE floating-point numbers not being representable as decimal
fractions), but at least deparsing both ignores the source references
and punts the encoding problem to the abstraction layer above it:
deparse() is the same for both '\uff' and iconv('\uff', 'UTF-8',
'latin1'): just "ÿ".

Unfortunately, this doesn't solve the environment problem. For these,
you really need a way to canonicalize the reference-semantics objects
before serializing them without changing the originals, even in cases
like a <- new.env(); b <- new.env(); a$x <- b; b$x <- a. I'm not sure
that reference hooks can help with that. In order to implement it
properly, the fixup process will have to rely on global state and keep
weak references to the environments it visits and creates shadow copies
of.

I think it's not impossible to implement
serialize_to_canonical_representation() for an R package, but it will
be a lot of work to decide which parts are canonical and which should
be discarded.

-- 
Best regards,
Ivan

[*]
https://cran.r-project.org/doc/manuals/R-ints.html#Serialization-Formats

[**]
https://bugs.r-project.org/show_bug.cgi?id=18638

[***]
https://stat.ethz.ch/pipermail/r-devel/2023-March/082505.html



More information about the R-devel mailing list