[Rd] Small inconsistency in serialize() between R versions and implications on digest()

Luke Tierney luke at stat.uiowa.edu
Thu Mar 8 20:43:36 CET 2007


On Fri, 9 Mar 2007, Paul Murrell wrote:

> Hi
>
>
> Luke Tierney wrote:
>> On Wed, 7 Mar 2007, Henrik Bengtsson wrote:
>>
>>> To follow up, I went ahead and generated "random" object to scan for a
>>> common header for a given R version, and it seems to be that at most
>>> the first 18 bytes are non-data specific, which could be the length of
>>> the serialization header.
>>>
>>> Here is my code for this:
>>>
>>> scanSerialize <- function(object, hdr=NULL, ...) {
>>>  # Serialize object
>>>  raw <- serialize(object, connection=NULL, ascii=TRUE);
>>>
>>>  # First run?
>>>  if (is.null(hdr))
>>>    return(raw);
>>>
>>>  # Find differences between current longest header and new raw vector
>>>  n <- length(hdr);
>>>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
>>>
>>>  # No differences?
>>>  if (!any(diffs))
>>>    return(hdr);
>>>
>>>  # Position of first difference
>>>  idx <- which(diffs)[1];
>>>
>>>  # Keep common header
>>>  hdr <- hdr[seq_len(idx-1)];
>>>
>>>  hdr;
>>> };
>>>
>>> # Serialize a first "random" object
>>> hdr <- scanSerialize(NA);
>>> for (kk in 1:100)
>>>  hdr <- scanSerialize(kk, hdr=hdr);
>>> for (kk in 1:100) {
>>>  x <- sample(letters, size=sample(100), replace=TRUE);
>>>  hdr <- scanSerialize(x, hdr=hdr);
>>> }
>>> for (kk in 1:100) {
>>>  hdr <- scanSerialize(kk, hdr=hdr);
>>>  hdr <- scanSerialize(hdr, hdr=hdr);
>>> }
>>>
>>> cat("Length:", length(hdr), "\n");
>>> print(hdr);
>>> print(rawToChar(hdr));
>>>
>>> On R v2.5.0 devel, this gives:
>>> Length: 18
>>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
>>> [1] "A\n2\n132352\n131840\n"
>>>
>>> However, it would still be good to get an "official" statement from
>>> one in the R-code team about the serialization header and where the
>>> data section start.  Again, I want to cut out as much as possible for
>>> consistency between R version without loosing data dependent bytes.
>>
>> An official, and definitive, statement from the _R-core_ team has been
>> available to you all along at
>>
>>  	https://svn.r-project.org/R/trunk/src/main/serialize.c
>
>
> There's also a bit of info on this in Section 1.7 of the "R Internals"
> Manual.
>
> Paul

Thanks -- I'd forgotten about that.  Looking at that shows that my
unofficial and non-definitive interpretation was not quite right for
the binary case -- the header there is 14 bytes (I forgot that there
is a \n after the X even in the binary case).

Best,

luke

>
>
>> My unofficial and non-definitive interpretation of that statement is
>> that there is a header of four items,
>>
>>      A format code 'A' or 'X' ('B' also possible in older formats)
>>      version number of the format
>>      Packed integer containint the R version that did the serializing
>>      Packed integer containing the oldest R version that can read the format
>>
>> You can see this if you look at the ascii version as text:
>>
>>     > serialize(1, stdout(), ascii=TRUE)
>>      A
>>      2
>>      132097
>>      131840
>>      14
>>      1
>>      1
>>      NULL
>>     > serialize(as.integer(1), stdout(), ascii=TRUE)
>>      A
>>      2
>>      132097
>>      131840
>>      13
>>      1
>>      1
>>      NULL
>>
>> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
>> In ascii format I believe it is currently 18 bytes but this could
>> change with the version number of R -- I'd have to read the official
>> and definitive statement to see how the integer packing is done and
>> work out whether that could change the number of bytes. The number of
>> bytes would also change if we reached format version 10, but something
>> about the format would also change of course.  A safer way to look at
>> the header in the ascii version is as the first four lines.
>>
>> Best,
>>
>> luke
>>
>>> Thanks
>>>
>>> /Henrik
>>>
>>> On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
>>>> Hi,
>>>>
>>>> I noticed that serialize() gives different results depending on R
>>>> version, which has implications to the digest() function in the digest
>>>> package.  Note, it does give the same output across platforms.  I know
>>>> that serialize() is under development, but is this expected, e.g. is
>>>> there some kind of header in the result that specifies "who" generated
>>>> the stream, and if so, exactly what bytes are they?
>>>>
>>>> SETUP:
>>>>
>>>> R versions:
>>>> A) R v2.4.0 (2006-10-03)
>>>> B) R v2.4.1pat (2007-01-13 r40470)
>>>> C) R v2.5.0dev (2006-12-12 r40167)
>>>>
>>>> This is on WinXP and I start R with Rterm --vanilla.
>>>>
>>>> Example: Identical serialize() calls using the different R versions.
>>>>
>>>>> raw <- serialize(1, connection=NULL, ascii=TRUE)
>>>>> print(raw)
>>>> gives:
>>>>
>>>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
>>>> 0a 31 0a 31 0a
>>>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
>>>> 0a 31 0a 31 0a
>>>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
>>>> 0a 31 0a 31 0a
>>>>
>>>> Note the difference in raw bytes 8 to 10, i.e.
>>>>
>>>>> raw[7:11]
>>>> (A): [1] 32 30 39 36 0a
>>>> (B): [1] 32 30 39 37 0a
>>>> (C): [1] 32 33 35 32 0a
>>>>
>>>> Does bytes 8, 9 and 10 in the raw vector somehow contain information
>>>> about the R version or similar?  The following poor mans test says
>>>> that is the only difference:
>>>>
>>>> On all R versions, the following gives identical results:
>>>>
>>>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
>>>>> raw <- as.integer(raw[-c(8:10)])
>>>>> sum(raw)
>>>> [1] 2147884
>>>>> sum(log(raw))
>>>> [1] 177201.2
>>>>
>>>> If it is true that there is a R version specific header in serialized
>>>> objects, then the digest() function should exclude such header in
>>>> order to produce consistent results across R versions, because now
>>>> digest(1) gives different results.
>>>>
>>>> Thank you
>>>>
>>>> Henrik
>>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



More information about the R-devel mailing list