[Rd] Small inconsistency in serialize() between R versions and implications on digest()

Thu Mar 8 05:11:00 CET 2007

To follow up, I went ahead and generated "random" object to scan for a
common header for a given R version, and it seems to be that at most
the first 18 bytes are non-data specific, which could be the length of
the serialization header.

Here is my code for this:

scanSerialize <- function(object, hdr=NULL, ...) {
  # Serialize object
  raw <- serialize(object, connection=NULL, ascii=TRUE);

  # First run?
  if (is.null(hdr))
    return(raw);

  # Find differences between current longest header and new raw vector
  n <- length(hdr);
  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));

  # No differences?
  if (!any(diffs))
    return(hdr);

  # Position of first difference
  idx <- which(diffs)[1];

  # Keep common header
  hdr <- hdr[seq_len(idx-1)];

  hdr;
};

# Serialize a first "random" object
hdr <- scanSerialize(NA);
for (kk in 1:100)
  hdr <- scanSerialize(kk, hdr=hdr);
for (kk in 1:100) {
  x <- sample(letters, size=sample(100), replace=TRUE);
  hdr <- scanSerialize(x, hdr=hdr);
}
for (kk in 1:100) {
  hdr <- scanSerialize(kk, hdr=hdr);
  hdr <- scanSerialize(hdr, hdr=hdr);
}

cat("Length:", length(hdr), "\n");
print(hdr);
print(rawToChar(hdr));

On R v2.5.0 devel, this gives:
Length: 18
 [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
[1] "A\n2\n132352\n131840\n"

However, it would still be good to get an "official" statement from
one in the R-code team about the serialization header and where the
data section start.  Again, I want to cut out as much as possible for
consistency between R version without loosing data dependent bytes.

Thanks

/Henrik

On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
> Hi,
>
> I noticed that serialize() gives different results depending on R
> version, which has implications to the digest() function in the digest
> package.  Note, it does give the same output across platforms.  I know
> that serialize() is under development, but is this expected, e.g. is
> there some kind of header in the result that specifies "who" generated
> the stream, and if so, exactly what bytes are they?
>
> SETUP:
>
> R versions:
> A) R v2.4.0 (2006-10-03)
> B) R v2.4.1pat (2007-01-13 r40470)
> C) R v2.5.0dev (2006-12-12 r40167)
>
> This is on WinXP and I start R with Rterm --vanilla.
>
> Example: Identical serialize() calls using the different R versions.
>
> > raw <- serialize(1, connection=NULL, ascii=TRUE)
> > print(raw)
>
> gives:
>
> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
>
> Note the difference in raw bytes 8 to 10, i.e.
>
> > raw[7:11]
> (A): [1] 32 30 39 36 0a
> (B): [1] 32 30 39 37 0a
> (C): [1] 32 33 35 32 0a
>
> Does bytes 8, 9 and 10 in the raw vector somehow contain information
> about the R version or similar?  The following poor mans test says
> that is the only difference:
>
> On all R versions, the following gives identical results:
>
> > raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
> > raw <- as.integer(raw[-c(8:10)])
> > sum(raw)
> [1] 2147884
> > sum(log(raw))
> [1] 177201.2
>
> If it is true that there is a R version specific header in serialized
> objects, then the digest() function should exclude such header in
> order to produce consistent results across R versions, because now
> digest(1) gives different results.
>
> Thank you
>
> Henrik
>