[Rd] Choices to remove `srcref` (and its buddies) when serializing objects

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Jan 17 10:31:40 CET 2024


On 1/16/24 20:16, Dipterix Wang wrote:
> Could you recommend any packages/functions that compute hash such that 
> the source references and sexpinfo_struct are ignored? Basically a 
> version of `serialize` that convert R objects to raw without storing 
> the ancillary source reference and sexpinfo.
> I think most people would think of `digest` but that package uses 
> `serialize` (see discussion 
> https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)

I think one could implement hashing on the fly without any 
serialization, similarly to how identical works, but I am not aware of 
any existing implementation. Again, if that wasn't clear: I don't think 
trying to compute a hash of an object from its serialized representation 
is a good idea - it is of course convenient, but has problems like the 
one you have ran into.

In some applications it may still be good enough: if by various tweaks, 
such as ensuring source references are off in your case, you achieve a 
state when false alarms are rare (identical objects have different 
hashes), and hence say unnecessary re-computation is rare, maybe it is 
good enough.

Tomas

>
>> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera 
>> <tomas.kalibera using gmail.com> wrote:
>>
>>
>> On 1/12/24 06:11, Dipterix Wang wrote:
>>> Dear R devs,
>>>
>>> I was digging into a package issue today when I realized R serialize 
>>> function not always generate the same results on equivalent objects 
>>> when users choose to run differently. For example, the following code
>>>
>>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
>>>
>>> generates different results when I copy-paste into console vs when I 
>>> use ctrl+shift+enter to source the file in RStudio.
>>>
>>> With a deeper inspect into the cause, I found that function and 
>>> language get source reference when getOption("keep.source") is TRUE. 
>>> This means the source reference will make the functions different 
>>> while in most cases, whether keeping function source might not 
>>> impact how a function behaves.
>>>
>>> While it's OK that function serialize generates different results, 
>>> functions such as `rlang::hash` and `digest::digest`, which depend 
>>> on `serialize` might eventually deliver false positives on same 
>>> inputs. I've checked source code in digest package hoping to get 
>>> around this issue (for example serialize(..., refhook = ...)). 
>>> However, my workaround did not work. It seems that the markers to 
>>> the objects are different even if I used `refhook` to force srcref 
>>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`. 
>>> None of them works directly on nested environments with multiple 
>>> functions.
>>>
>>> I wonder how hard it would be to have options to discard source when 
>>> serializing R objects?
>>>
>>> Currently my analyses heavily depend on digest function to generate 
>>> file caches and automatically schedule pipelines (to update cache) 
>>> when changes are detected. The pipelines save the hashes of source 
>>> code, inputs, and outputs together so other people can easily verify 
>>> the calculation without accessing the original data (which could be 
>>> sensitive), or running hour-long analyses, or having to buy servers. 
>>> All of these require `serialize` to produce the same results 
>>> regardless of how users choose to run the code.
>>>
>>> It would be great if this feature could be in the future R. Other 
>>> pipeline packages such as `targets` and `drake` can also benefit 
>>> from it.
>>
>> I don't think such functionality would belong to serialize(). This 
>> function is not meant to produce stable results based on the input, 
>> the serialized representation may even differ based on properties not 
>> seen by users.
>>
>> I think an option to ignore source code would belong to a function 
>> that computes the hash, as other options of identical().
>>
>> Tomas
>>
>>
>>> Thanks,
>>>
>>> - Dipterix
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel using r-project.orgmailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>



More information about the R-devel mailing list