[Rd] Small inconsistency in serialize() between R versions and implications on digest()

Thu Mar 8 13:14:12 CET 2007

On Wed, 7 Mar 2007, Henrik Bengtsson wrote:

> To follow up, I went ahead and generated "random" object to scan for a
> common header for a given R version, and it seems to be that at most
> the first 18 bytes are non-data specific, which could be the length of
> the serialization header.
>
> Here is my code for this:
>
> scanSerialize <- function(object, hdr=NULL, ...) {
>  # Serialize object
>  raw <- serialize(object, connection=NULL, ascii=TRUE);
>
>  # First run?
>  if (is.null(hdr))
>    return(raw);
>
>  # Find differences between current longest header and new raw vector
>  n <- length(hdr);
>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
>
>  # No differences?
>  if (!any(diffs))
>    return(hdr);
>
>  # Position of first difference
>  idx <- which(diffs)[1];
>
>  # Keep common header
>  hdr <- hdr[seq_len(idx-1)];
>
>  hdr;
> };
>
> # Serialize a first "random" object
> hdr <- scanSerialize(NA);
> for (kk in 1:100)
>  hdr <- scanSerialize(kk, hdr=hdr);
> for (kk in 1:100) {
>  x <- sample(letters, size=sample(100), replace=TRUE);
>  hdr <- scanSerialize(x, hdr=hdr);
> }
> for (kk in 1:100) {
>  hdr <- scanSerialize(kk, hdr=hdr);
>  hdr <- scanSerialize(hdr, hdr=hdr);
> }
>
> cat("Length:", length(hdr), "\n");
> print(hdr);
> print(rawToChar(hdr));
>
> On R v2.5.0 devel, this gives:
> Length: 18
> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
> [1] "A\n2\n132352\n131840\n"
>
> However, it would still be good to get an "official" statement from
> one in the R-code team about the serialization header and where the
> data section start.  Again, I want to cut out as much as possible for
> consistency between R version without loosing data dependent bytes.

An official, and definitive, statement from the _R-core_ team has been
available to you all along at

 	https://svn.r-project.org/R/trunk/src/main/serialize.c

My unofficial and non-definitive interpretation of that statement is
that there is a header of four items,

     A format code 'A' or 'X' ('B' also possible in older formats)
     version number of the format
     Packed integer containint the R version that did the serializing
     Packed integer containing the oldest R version that can read the format

You can see this if you look at the ascii version as text:

     > serialize(1, stdout(), ascii=TRUE)
     A
     2
     132097
     131840
     14
     1
     1
     NULL
     > serialize(as.integer(1), stdout(), ascii=TRUE)
     A
     2
     132097
     131840
     13
     1
     1
     NULL

In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
In ascii format I believe it is currently 18 bytes but this could
change with the version number of R -- I'd have to read the official
and definitive statement to see how the integer packing is done and
work out whether that could change the number of bytes. The number of
bytes would also change if we reached format version 10, but something
about the format would also change of course.  A safer way to look at
the header in the ascii version is as the first four lines.

Best,

luke

>
> Thanks
>
> /Henrik
>
> On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
>> Hi,
>>
>> I noticed that serialize() gives different results depending on R
>> version, which has implications to the digest() function in the digest
>> package.  Note, it does give the same output across platforms.  I know
>> that serialize() is under development, but is this expected, e.g. is
>> there some kind of header in the result that specifies "who" generated
>> the stream, and if so, exactly what bytes are they?
>>
>> SETUP:
>>
>> R versions:
>> A) R v2.4.0 (2006-10-03)
>> B) R v2.4.1pat (2007-01-13 r40470)
>> C) R v2.5.0dev (2006-12-12 r40167)
>>
>> This is on WinXP and I start R with Rterm --vanilla.
>>
>> Example: Identical serialize() calls using the different R versions.
>>
>>> raw <- serialize(1, connection=NULL, ascii=TRUE)
>>> print(raw)
>>
>> gives:
>>
>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
>> 0a 31 0a 31 0a
>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
>> 0a 31 0a 31 0a
>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
>> 0a 31 0a 31 0a
>>
>> Note the difference in raw bytes 8 to 10, i.e.
>>
>>> raw[7:11]
>> (A): [1] 32 30 39 36 0a
>> (B): [1] 32 30 39 37 0a
>> (C): [1] 32 33 35 32 0a
>>
>> Does bytes 8, 9 and 10 in the raw vector somehow contain information
>> about the R version or similar?  The following poor mans test says
>> that is the only difference:
>>
>> On all R versions, the following gives identical results:
>>
>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
>>> raw <- as.integer(raw[-c(8:10)])
>>> sum(raw)
>> [1] 2147884
>>> sum(log(raw))
>> [1] 177201.2
>>
>> If it is true that there is a R version specific header in serialized
>> objects, then the digest() function should exclude such header in
>> order to produce consistent results across R versions, because now
>> digest(1) gives different results.
>>
>> Thank you
>>
>> Henrik
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu