[Rd] Choices to remove `srcref` (and its buddies) when serializing objects
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Jan 17 10:31:40 CET 2024
On 1/16/24 20:16, Dipterix Wang wrote:
> Could you recommend any packages/functions that compute hash such that
> the source references and sexpinfo_struct are ignored? Basically a
> version of `serialize` that convert R objects to raw without storing
> the ancillary source reference and sexpinfo.
> I think most people would think of `digest` but that package uses
> `serialize` (see discussion
> https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)
I think one could implement hashing on the fly without any
serialization, similarly to how identical works, but I am not aware of
any existing implementation. Again, if that wasn't clear: I don't think
trying to compute a hash of an object from its serialized representation
is a good idea - it is of course convenient, but has problems like the
one you have ran into.
In some applications it may still be good enough: if by various tweaks,
such as ensuring source references are off in your case, you achieve a
state when false alarms are rare (identical objects have different
hashes), and hence say unnecessary re-computation is rare, maybe it is
good enough.
Tomas
>
>> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera
>> <tomas.kalibera using gmail.com> wrote:
>>
>>
>> On 1/12/24 06:11, Dipterix Wang wrote:
>>> Dear R devs,
>>>
>>> I was digging into a package issue today when I realized R serialize
>>> function not always generate the same results on equivalent objects
>>> when users choose to run differently. For example, the following code
>>>
>>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
>>>
>>> generates different results when I copy-paste into console vs when I
>>> use ctrl+shift+enter to source the file in RStudio.
>>>
>>> With a deeper inspect into the cause, I found that function and
>>> language get source reference when getOption("keep.source") is TRUE.
>>> This means the source reference will make the functions different
>>> while in most cases, whether keeping function source might not
>>> impact how a function behaves.
>>>
>>> While it's OK that function serialize generates different results,
>>> functions such as `rlang::hash` and `digest::digest`, which depend
>>> on `serialize` might eventually deliver false positives on same
>>> inputs. I've checked source code in digest package hoping to get
>>> around this issue (for example serialize(..., refhook = ...)).
>>> However, my workaround did not work. It seems that the markers to
>>> the objects are different even if I used `refhook` to force srcref
>>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`.
>>> None of them works directly on nested environments with multiple
>>> functions.
>>>
>>> I wonder how hard it would be to have options to discard source when
>>> serializing R objects?
>>>
>>> Currently my analyses heavily depend on digest function to generate
>>> file caches and automatically schedule pipelines (to update cache)
>>> when changes are detected. The pipelines save the hashes of source
>>> code, inputs, and outputs together so other people can easily verify
>>> the calculation without accessing the original data (which could be
>>> sensitive), or running hour-long analyses, or having to buy servers.
>>> All of these require `serialize` to produce the same results
>>> regardless of how users choose to run the code.
>>>
>>> It would be great if this feature could be in the future R. Other
>>> pipeline packages such as `targets` and `drake` can also benefit
>>> from it.
>>
>> I don't think such functionality would belong to serialize(). This
>> function is not meant to produce stable results based on the input,
>> the serialized representation may even differ based on properties not
>> seen by users.
>>
>> I think an option to ignore source code would belong to a function
>> that computes the hash, as other options of identical().
>>
>> Tomas
>>
>>
>>> Thanks,
>>>
>>> - Dipterix
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel using r-project.orgmailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list