[Rd] Choices to remove `srcref` (and its buddies) when serializing objects

Lionel Henry ||one| @end|ng |rom po@|t@co
Wed Jan 17 11:32:59 CET 2024


> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation

We have one in vctrs but it's not exported:
https://github.com/r-lib/vctrs/blob/main/src/hash.c

The main use is vectorised hashing:

```
# Non-vectorised
vctrs:::obj_hash(1:10)
#> [1] 1e 77 ce 48

# Vectorised
vctrs:::vec_hash(1L)
#> [1] 70 a2 85 ef
vctrs:::vec_hash(1:2)
#> [1] 70 a2 85 ef bf 3c 2c cf

# vctrs semantics so dfs are vectors of rows
length(vctrs:::vec_hash(mtcars)) / 4
#> [1] 32
nrow(mtcars)
#> [1] 32
```

Best,
Lionel

On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
<tomas.kalibera using gmail.com> wrote:
>
> On 1/16/24 20:16, Dipterix Wang wrote:
> > Could you recommend any packages/functions that compute hash such that
> > the source references and sexpinfo_struct are ignored? Basically a
> > version of `serialize` that convert R objects to raw without storing
> > the ancillary source reference and sexpinfo.
> > I think most people would think of `digest` but that package uses
> > `serialize` (see discussion
> > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)
>
> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation. Again, if that wasn't clear: I don't think
> trying to compute a hash of an object from its serialized representation
> is a good idea - it is of course convenient, but has problems like the
> one you have ran into.
>
> In some applications it may still be good enough: if by various tweaks,
> such as ensuring source references are off in your case, you achieve a
> state when false alarms are rare (identical objects have different
> hashes), and hence say unnecessary re-computation is rare, maybe it is
> good enough.
>
> Tomas
>
> >
> >> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera
> >> <tomas.kalibera using gmail.com> wrote:
> >>
> >>
> >> On 1/12/24 06:11, Dipterix Wang wrote:
> >>> Dear R devs,
> >>>
> >>> I was digging into a package issue today when I realized R serialize
> >>> function not always generate the same results on equivalent objects
> >>> when users choose to run differently. For example, the following code
> >>>
> >>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
> >>>
> >>> generates different results when I copy-paste into console vs when I
> >>> use ctrl+shift+enter to source the file in RStudio.
> >>>
> >>> With a deeper inspect into the cause, I found that function and
> >>> language get source reference when getOption("keep.source") is TRUE.
> >>> This means the source reference will make the functions different
> >>> while in most cases, whether keeping function source might not
> >>> impact how a function behaves.
> >>>
> >>> While it's OK that function serialize generates different results,
> >>> functions such as `rlang::hash` and `digest::digest`, which depend
> >>> on `serialize` might eventually deliver false positives on same
> >>> inputs. I've checked source code in digest package hoping to get
> >>> around this issue (for example serialize(..., refhook = ...)).
> >>> However, my workaround did not work. It seems that the markers to
> >>> the objects are different even if I used `refhook` to force srcref
> >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`.
> >>> None of them works directly on nested environments with multiple
> >>> functions.
> >>>
> >>> I wonder how hard it would be to have options to discard source when
> >>> serializing R objects?
> >>>
> >>> Currently my analyses heavily depend on digest function to generate
> >>> file caches and automatically schedule pipelines (to update cache)
> >>> when changes are detected. The pipelines save the hashes of source
> >>> code, inputs, and outputs together so other people can easily verify
> >>> the calculation without accessing the original data (which could be
> >>> sensitive), or running hour-long analyses, or having to buy servers.
> >>> All of these require `serialize` to produce the same results
> >>> regardless of how users choose to run the code.
> >>>
> >>> It would be great if this feature could be in the future R. Other
> >>> pipeline packages such as `targets` and `drake` can also benefit
> >>> from it.
> >>
> >> I don't think such functionality would belong to serialize(). This
> >> function is not meant to produce stable results based on the input,
> >> the serialized representation may even differ based on properties not
> >> seen by users.
> >>
> >> I think an option to ignore source code would belong to a function
> >> that computes the hash, as other options of identical().
> >>
> >> Tomas
> >>
> >>
> >>> Thanks,
> >>>
> >>> - Dipterix
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-devel using r-project.orgmailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list