[Rd] Small inconsistency in serialize() between R versions and implications on digest()
Henrik Bengtsson
hb at stat.berkeley.edu
Thu Mar 8 05:11:00 CET 2007
To follow up, I went ahead and generated "random" object to scan for a
common header for a given R version, and it seems to be that at most
the first 18 bytes are non-data specific, which could be the length of
the serialization header.
Here is my code for this:
scanSerialize <- function(object, hdr=NULL, ...) {
# Serialize object
raw <- serialize(object, connection=NULL, ascii=TRUE);
# First run?
if (is.null(hdr))
return(raw);
# Find differences between current longest header and new raw vector
n <- length(hdr);
diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
# No differences?
if (!any(diffs))
return(hdr);
# Position of first difference
idx <- which(diffs)[1];
# Keep common header
hdr <- hdr[seq_len(idx-1)];
hdr;
};
# Serialize a first "random" object
hdr <- scanSerialize(NA);
for (kk in 1:100)
hdr <- scanSerialize(kk, hdr=hdr);
for (kk in 1:100) {
x <- sample(letters, size=sample(100), replace=TRUE);
hdr <- scanSerialize(x, hdr=hdr);
}
for (kk in 1:100) {
hdr <- scanSerialize(kk, hdr=hdr);
hdr <- scanSerialize(hdr, hdr=hdr);
}
cat("Length:", length(hdr), "\n");
print(hdr);
print(rawToChar(hdr));
On R v2.5.0 devel, this gives:
Length: 18
[1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
[1] "A\n2\n132352\n131840\n"
However, it would still be good to get an "official" statement from
one in the R-code team about the serialization header and where the
data section start. Again, I want to cut out as much as possible for
consistency between R version without loosing data dependent bytes.
Thanks
/Henrik
On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
> Hi,
>
> I noticed that serialize() gives different results depending on R
> version, which has implications to the digest() function in the digest
> package. Note, it does give the same output across platforms. I know
> that serialize() is under development, but is this expected, e.g. is
> there some kind of header in the result that specifies "who" generated
> the stream, and if so, exactly what bytes are they?
>
> SETUP:
>
> R versions:
> A) R v2.4.0 (2006-10-03)
> B) R v2.4.1pat (2007-01-13 r40470)
> C) R v2.5.0dev (2006-12-12 r40167)
>
> This is on WinXP and I start R with Rterm --vanilla.
>
> Example: Identical serialize() calls using the different R versions.
>
> > raw <- serialize(1, connection=NULL, ascii=TRUE)
> > print(raw)
>
> gives:
>
> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
> 0a 31 0a 31 0a
>
> Note the difference in raw bytes 8 to 10, i.e.
>
> > raw[7:11]
> (A): [1] 32 30 39 36 0a
> (B): [1] 32 30 39 37 0a
> (C): [1] 32 33 35 32 0a
>
> Does bytes 8, 9 and 10 in the raw vector somehow contain information
> about the R version or similar? The following poor mans test says
> that is the only difference:
>
> On all R versions, the following gives identical results:
>
> > raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
> > raw <- as.integer(raw[-c(8:10)])
> > sum(raw)
> [1] 2147884
> > sum(log(raw))
> [1] 177201.2
>
> If it is true that there is a R version specific header in serialized
> objects, then the digest() function should exclude such header in
> order to produce consistent results across R versions, because now
> digest(1) gives different results.
>
> Thank you
>
> Henrik
>
More information about the R-devel
mailing list