[Bioc-devel] SummarizedExperiments not equal after serialisation

Pages, Herve hp@ge@ @end|ng |rom |redhutch@org
Thu May 16 08:19:00 CEST 2019


Let's try to go to the bottom of this. But let's leave SummarizedExperiment objects out of the picture for now and focus on what happens with a very simple reference object.

When you create 2 instances of a reference class with the same content:

  A <- setRefClass("A", fields=c(stuff="ANY"))
  a0 <- A(stuff=letters)
  a1 <- A(stuff=letters)


the .xData slot (which is an environment) is "different" between the 2 instances in the sense that the 2 environments live at different addresses in memory:

  a0 using .xData<mailto:a0 using .xData>                        # <environment: 0x3812150>
  a1 using .xData<mailto:a1 using .xData>                        # <environment: 0x381c7e0>
  identical(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>)  # FALSE


However their **content** is the same:

  all.equal(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>)  # TRUE


and the 2 objects are considered equal:

  all.equal(a0, a1)                # TRUE


When the **content** of the 2 objects differ, all.equal() sees 2 environments with different contents:

  b <- A(stuff=LETTERS)
  isTRUE(all.equal(a0 using .xData<mailto:a0 using .xData>, b using .xData<mailto:b using .xData>)) # FALSE

and no longer considers the 2 objects equal:

  all.equal(a0, b)                 # "Component “stuff”: 26 string mismatches"


So far so good.

When an object goes thru a serialization/deserialization cycle:

  saveRDS(a0, "a0.rds")
  a2 <- readRDS("a0.rds")


the .xData slot of the restored object also lives at a different address:

  a2 using .xData<mailto:a2 using .xData>                        # <environment: 0x3944668>
  identical(a0 using .xData<mailto:a0 using .xData>, a2 using .xData<mailto:a2 using .xData>)  # FALSE


(This is what serialization/deserialization does on environments so is expected.)

So in that aspect 'a2' is no different from 'a1'. However for 'a2' now we have:

  all.equal(a0, a2)                # "Class definitions are not identical"


So why is 'all.equal(a0, a2)' doing this? This cannot be explained only by the fact that 'a0 using .xData<mailto:a0 using .xData>' and 'a2 using .xData<mailto:a2 using .xData>' are non-identical environments.

Looking at the source code for all.equal.envRefClass(), we see something like this (slightly simplified here):

  ...
  if (!identical(target$getClass(), current$getClass())) {
      ...
      return(sprintf("Class definitions are not identical%s", ...)
  }
  ...


So let's try this:

  identical(a0$getClass(), a1$getClass())  # TRUE
  identical(a0$getClass(), a2$getClass())  # FALSE

Note that 'x$getClass()' is not the same as 'class(x)'. The latter returns the **class name** while the former returns the **class definition** (which is represented by a complicated object of class refClassRepresentation).

'a0' and 'a2' have identical class names:

  class(a0)
  # [1] "A"
  # attr(,"package")
  # [1] ".GlobalEnv"

  class(a2)
  # [1] "A"
  # attr(,"package")
  # [1] ".GlobalEnv"

  identical(class(a0), class(a2))
  # [1] TRUE


So now the question is: even though 'a0' and 'a2' have identical **class names**, how come they do NOT have identical **class definitions**?

The big surprise (at least to me) is that reference objects, unlike traditional S4 objects, CARRY THEIR OWN COPY OF THE CLASS DEFINITION! This copy is stored in the '.refClassDef' variable stored in the .xData environment of the object:

  ls(a0 using .xData<mailto:a0 using .xData>, all=TRUE)
  # [1] ".refClassDef" ".self"        "getClass"     "stuff"

  ls(a2 using .xData<mailto:a2 using .xData>, all=TRUE)
  # [1] ".refClassDef" ".self"        "getClass"     "stuff"

This private copy of the class definition is actually what 'x$getClass()' returns:

  identical(a0$getClass(), get(".refClassDef", envir=a0 using .xData<mailto:envir=a0 using .xData>))  # TRUE
  identical(a2$getClass(), get(".refClassDef", envir=a2 using .xData<mailto:envir=a2 using .xData>))  # TRUE


Problem is that for 'a2' this copy of the class definition is not identical to the **original class** definition:

  identical(getClass("A"), a0$getClass())  # TRUE
  identical(getClass("A"), a2$getClass())  # FALSE


And this in turn is because the complicated object that represents the class definition also contains environments (e.g. 'getClass("A")@refMethods' is an environment) so going thru a serialization/deserialization cycle is not a **strict no-op** on it (from an identical() perspective).

Replacing the copy of the class definition stored in 'a2' with the original class definition makes the problem go away:

  rm(".refClassDef", envir=a2 using .xData<mailto:envir=a2 using .xData>)
  assign(".refClassDef", getClass("A"), envir=a2 using .xData<mailto:envir=a2 using .xData>)
  all.equal(a0, a2)  # TRUE


Bottom line: the test 'identical(target$getClass(), current$getClass())' performed by all.equal.envRefClass() seems too stringent. It should probably be replaced with something a little bit more tolerant i.e. something that considers environments that live at different addresses but have the same content to be equal. Looks like 'isTRUE(all.equal(target$getClass(), current$getClass()))' could do the job.

Finally note that, in addition to the above test, all.equal.envRefClass() also does this test (slightly simplified here):

  if (!isTRUE(all.equal(class(target), class(current))))
      return(sprintf("Classes differ: %s", ...))


Maybe that's all what it needs to do to compare the classes of the 2 objects? (Ironically this test uses all.equal() when it could use identical().)

Michael?

H.


On 5/11/19 15:09, Aaron Lun wrote:
I would say it's much worse than mismatching class definitions.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=
-A

On 5/11/19 5:07 AM, Martin Morgan wrote:
I think it has to do with the use of reference classes in the assay slot, which have different environments

   se = SummarizedExperiment()
   saveRDS(se, fl <- tempfile())
   se1 = readRDS(fl)

and then

all.equal(se using assays, se1 using assays)
[1] "Class definitions are not identical"
all.equal(se using assays@.xData<mailto:se using assays@.xData>, se1 using assays@.xData<mailto:se1 using assays@.xData>)
[1] "Component \".self\": Class definitions are not identical"
se using assays@.xData<mailto:se using assays@.xData>
<environment: 0x7fb1de1ede90>
se1 using assays@.xData<mailto:se1 using assays@.xData>
<environment: 0x7fb1fc2bca78>

Martin

On 5/11/19, 6:38 AM, "Bioc-devel on behalf of Laurent Gatto" <bioc-devel-bounces using r-project.org on behalf of laurent.gatto using uclouvain.be><mailto:bioc-devel-bounces using r-project.orgonbehalfoflaurent.gatto@uclouvain.be> wrote:

     I would appreciate some background about the following:
          > suppressPackageStartupMessages(library("SummarizedExperiment"))
     > set.seed(1L)
     > m <- matrix(rnorm(16), ncol = 4, dimnames = list(letters[1:4], LETTERS[1:4]))
     > rowdata <- DataFrame(X = 1:4, row.names = letters[1:4])
     > se1 <- SummarizedExperiment(m, rowData = rowdata)
     > se2 <- SummarizedExperiment(m, rowData = rowdata)
     > all.equal(se1, se2)
     [1] TRUE
          But after serialising and reading se2, the two instances aren't equal any more:
          > saveRDS(se2, file = "se2.rds")
     > rm(se2)
     > se2 <- readRDS("se2.rds")
     > all.equal(se1, se2)
     [1] "Attributes: < Component “assays”: Class definitions are not identical >"
          Session information provided below.
          Thank you in advance,
          Laurent
               R version 3.6.0 RC (2019-04-21 r76417)
     Platform: x86_64-pc-linux-gnu (64-bit)
     Running under: Ubuntu 18.04.2 LTS
          Matrix products: default
     BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
     LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
          locale:
      [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
      [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8
      [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8
      [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C
      [9] LC_ADDRESS=C               LC_TELEPHONE=C
     [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
          attached base packages:
     [1] parallel  stats4    stats     graphics  grDevices utils     datasets
     [8] methods   base
          other attached packages:
      [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
      [3] BiocParallel_1.18.0         matrixStats_0.54.0
      [5] Biobase_2.44.0              GenomicRanges_1.36.0
      [7] GenomeInfoDb_1.20.0         IRanges_2.18.0
      [9] S4Vectors_0.22.0            BiocGenerics_0.30.0
          loaded via a namespace (and not attached):
      [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0
      [4] zlibbioc_1.30.0        XVector_0.24.0         Matrix_1.2-17
      [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0
     [10] GenomeInfoDbData_1.2.1
               _______________________________________________
     Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
     https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=      _______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=

_______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org<mailto:hpages using fredhutch.org>
Phone:  (206) 667-5791
Fax:    (206) 667-1319


	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list