[Rd] Small inconsistency in serialize() between R versions and implications on digest()

Henrik Bengtsson hb at stat.berkeley.edu
Thu Mar 8 21:57:29 CET 2007


On 3/8/07, Luke Tierney <luke at stat.uiowa.edu> wrote:
> On Fri, 9 Mar 2007, Paul Murrell wrote:
>
> > Hi
> >
> >
> > Luke Tierney wrote:
> >> On Wed, 7 Mar 2007, Henrik Bengtsson wrote:
> >>
> >>> To follow up, I went ahead and generated "random" object to scan for a
> >>> common header for a given R version, and it seems to be that at most
> >>> the first 18 bytes are non-data specific, which could be the length of
> >>> the serialization header.
> >>>
> >>> Here is my code for this:
> >>>
> >>> scanSerialize <- function(object, hdr=NULL, ...) {
> >>>  # Serialize object
> >>>  raw <- serialize(object, connection=NULL, ascii=TRUE);
> >>>
> >>>  # First run?
> >>>  if (is.null(hdr))
> >>>    return(raw);
> >>>
> >>>  # Find differences between current longest header and new raw vector
> >>>  n <- length(hdr);
> >>>  diffs <- (as.integer(hdr) != as.integer(raw[1:n]));
> >>>
> >>>  # No differences?
> >>>  if (!any(diffs))
> >>>    return(hdr);
> >>>
> >>>  # Position of first difference
> >>>  idx <- which(diffs)[1];
> >>>
> >>>  # Keep common header
> >>>  hdr <- hdr[seq_len(idx-1)];
> >>>
> >>>  hdr;
> >>> };
> >>>
> >>> # Serialize a first "random" object
> >>> hdr <- scanSerialize(NA);
> >>> for (kk in 1:100)
> >>>  hdr <- scanSerialize(kk, hdr=hdr);
> >>> for (kk in 1:100) {
> >>>  x <- sample(letters, size=sample(100), replace=TRUE);
> >>>  hdr <- scanSerialize(x, hdr=hdr);
> >>> }
> >>> for (kk in 1:100) {
> >>>  hdr <- scanSerialize(kk, hdr=hdr);
> >>>  hdr <- scanSerialize(hdr, hdr=hdr);
> >>> }
> >>>
> >>> cat("Length:", length(hdr), "\n");
> >>> print(hdr);
> >>> print(rawToChar(hdr));
> >>>
> >>> On R v2.5.0 devel, this gives:
> >>> Length: 18
> >>> [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a
> >>> [1] "A\n2\n132352\n131840\n"
> >>>
> >>> However, it would still be good to get an "official" statement from
> >>> one in the R-code team about the serialization header and where the
> >>> data section start.  Again, I want to cut out as much as possible for
> >>> consistency between R version without loosing data dependent bytes.
> >>
> >> An official, and definitive, statement from the _R-core_ team has been
> >> available to you all along at
> >>
> >>      https://svn.r-project.org/R/trunk/src/main/serialize.c
> >
> >
> > There's also a bit of info on this in Section 1.7 of the "R Internals"
> > Manual.
> >
> > Paul
>
> Thanks -- I'd forgotten about that.  Looking at that shows that my
> unofficial and non-definitive interpretation was not quite right for
> the binary case -- the header there is 14 bytes (I forgot that there
> is a \n after the X even in the binary case).

Luke and Paul, thank you for this.  Searching for the 4th newline
seems to be the most robust thing to do in the ASCII case.

/Henrik

>
> Best,
>
> luke
>
> >
> >
> >> My unofficial and non-definitive interpretation of that statement is
> >> that there is a header of four items,
> >>
> >>      A format code 'A' or 'X' ('B' also possible in older formats)
> >>      version number of the format
> >>      Packed integer containint the R version that did the serializing
> >>      Packed integer containing the oldest R version that can read the format
> >>
> >> You can see this if you look at the ascii version as text:
> >>
> >>     > serialize(1, stdout(), ascii=TRUE)
> >>      A
> >>      2
> >>      132097
> >>      131840
> >>      14
> >>      1
> >>      1
> >>      NULL
> >>     > serialize(as.integer(1), stdout(), ascii=TRUE)
> >>      A
> >>      2
> >>      132097
> >>      131840
> >>      13
> >>      1
> >>      1
> >>      NULL
> >>
> >> In the non-ascii 'X' (as in xdr) format this will constitute 13 bytes.
> >> In ascii format I believe it is currently 18 bytes but this could
> >> change with the version number of R -- I'd have to read the official
> >> and definitive statement to see how the integer packing is done and
> >> work out whether that could change the number of bytes. The number of
> >> bytes would also change if we reached format version 10, but something
> >> about the format would also change of course.  A safer way to look at
> >> the header in the ascii version is as the first four lines.
> >>
> >> Best,
> >>
> >> luke
> >>
> >>> Thanks
> >>>
> >>> /Henrik
> >>>
> >>> On 3/7/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
> >>>> Hi,
> >>>>
> >>>> I noticed that serialize() gives different results depending on R
> >>>> version, which has implications to the digest() function in the digest
> >>>> package.  Note, it does give the same output across platforms.  I know
> >>>> that serialize() is under development, but is this expected, e.g. is
> >>>> there some kind of header in the result that specifies "who" generated
> >>>> the stream, and if so, exactly what bytes are they?
> >>>>
> >>>> SETUP:
> >>>>
> >>>> R versions:
> >>>> A) R v2.4.0 (2006-10-03)
> >>>> B) R v2.4.1pat (2007-01-13 r40470)
> >>>> C) R v2.5.0dev (2006-12-12 r40167)
> >>>>
> >>>> This is on WinXP and I start R with Rterm --vanilla.
> >>>>
> >>>> Example: Identical serialize() calls using the different R versions.
> >>>>
> >>>>> raw <- serialize(1, connection=NULL, ascii=TRUE)
> >>>>> print(raw)
> >>>> gives:
> >>>>
> >>>> (A): [1] 41 0a 32 0a 31 33 32 30 39 36 0a 31 33 31 38 34 30 0a 31 34
> >>>> 0a 31 0a 31 0a
> >>>> (B): [1] 41 0a 32 0a 31 33 32 30 39 37 0a 31 33 31 38 34 30 0a 31 34
> >>>> 0a 31 0a 31 0a
> >>>> (C): [1] 41 0a 32 0a 31 33 32 33 35 32 0a 31 33 31 38 34 30 0a 31 34
> >>>> 0a 31 0a 31 0a
> >>>>
> >>>> Note the difference in raw bytes 8 to 10, i.e.
> >>>>
> >>>>> raw[7:11]
> >>>> (A): [1] 32 30 39 36 0a
> >>>> (B): [1] 32 30 39 37 0a
> >>>> (C): [1] 32 33 35 32 0a
> >>>>
> >>>> Does bytes 8, 9 and 10 in the raw vector somehow contain information
> >>>> about the R version or similar?  The following poor mans test says
> >>>> that is the only difference:
> >>>>
> >>>> On all R versions, the following gives identical results:
> >>>>
> >>>>> raw <- serialize(1:1e4, connection=NULL, ascii=TRUE)
> >>>>> raw <- as.integer(raw[-c(8:10)])
> >>>>> sum(raw)
> >>>> [1] 2147884
> >>>>> sum(log(raw))
> >>>> [1] 177201.2
> >>>>
> >>>> If it is true that there is a R version specific header in serialized
> >>>> objects, then the digest() function should exclude such header in
> >>>> order to produce consistent results across R versions, because now
> >>>> digest(1) gives different results.
> >>>>
> >>>> Thank you
> >>>>
> >>>> Henrik
> >>>>
> >>> ______________________________________________
> >>> R-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >
> >
>
> --
> Luke Tierney
> Chair, Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
>



More information about the R-devel mailing list