[Bioc-devel] SummarizedExperiments not equal after serialisation

Kasper Daniel Hansen k@@perd@n|e|h@n@en @end|ng |rom gm@||@com
Thu May 16 14:57:22 CEST 2019


Interesting detective work. This is nasty.

Best,
Kasper

On Thu, May 16, 2019 at 2:19 AM Pages, Herve <hpages using fredhutch.org> wrote:

> Let's try to go to the bottom of this. But let's leave
> SummarizedExperiment objects out of the picture for now and focus on what
> happens with a very simple reference object.
>
> When you create 2 instances of a reference class with the same content:
>
>   A <- setRefClass("A", fields=c(stuff="ANY"))
>   a0 <- A(stuff=letters)
>   a1 <- A(stuff=letters)
>
>
> the .xData slot (which is an environment) is "different" between the 2
> instances in the sense that the 2 environments live at different addresses
> in memory:
>
>   a0 using .xData<mailto:a0 using .xData>                        # <environment:
> 0x3812150>
>   a1 using .xData<mailto:a1 using .xData>                        # <environment:
> 0x381c7e0>
>   identical(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>)  #
> FALSE
>
>
> However their **content** is the same:
>
>   all.equal(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>)  #
> TRUE
>
>
> and the 2 objects are considered equal:
>
>   all.equal(a0, a1)                # TRUE
>
>
> When the **content** of the 2 objects differ, all.equal() sees 2
> environments with different contents:
>
>   b <- A(stuff=LETTERS)
>   isTRUE(all.equal(a0 using .xData<mailto:a0 using .xData>, b using .xData<mailto:b using .xData>))
> # FALSE
>
> and no longer considers the 2 objects equal:
>
>   all.equal(a0, b)                 # "Component “stuff”: 26 string
> mismatches"
>
>
> So far so good.
>
> When an object goes thru a serialization/deserialization cycle:
>
>   saveRDS(a0, "a0.rds")
>   a2 <- readRDS("a0.rds")
>
>
> the .xData slot of the restored object also lives at a different address:
>
>   a2 using .xData<mailto:a2 using .xData>                        # <environment:
> 0x3944668>
>   identical(a0 using .xData<mailto:a0 using .xData>, a2 using .xData<mailto:a2 using .xData>)  #
> FALSE
>
>
> (This is what serialization/deserialization does on environments so is
> expected.)
>
> So in that aspect 'a2' is no different from 'a1'. However for 'a2' now we
> have:
>
>   all.equal(a0, a2)                # "Class definitions are not identical"
>
>
> So why is 'all.equal(a0, a2)' doing this? This cannot be explained only by
> the fact that 'a0 using .xData<mailto:a0 using .xData>' and 'a2 using .xData<mailto:a2 using .xData>'
> are non-identical environments.
>
> Looking at the source code for all.equal.envRefClass(), we see something
> like this (slightly simplified here):
>
>   ...
>   if (!identical(target$getClass(), current$getClass())) {
>       ...
>       return(sprintf("Class definitions are not identical%s", ...)
>   }
>   ...
>
>
> So let's try this:
>
>   identical(a0$getClass(), a1$getClass())  # TRUE
>   identical(a0$getClass(), a2$getClass())  # FALSE
>
> Note that 'x$getClass()' is not the same as 'class(x)'. The latter returns
> the **class name** while the former returns the **class definition** (which
> is represented by a complicated object of class refClassRepresentation).
>
> 'a0' and 'a2' have identical class names:
>
>   class(a0)
>   # [1] "A"
>   # attr(,"package")
>   # [1] ".GlobalEnv"
>
>   class(a2)
>   # [1] "A"
>   # attr(,"package")
>   # [1] ".GlobalEnv"
>
>   identical(class(a0), class(a2))
>   # [1] TRUE
>
>
> So now the question is: even though 'a0' and 'a2' have identical **class
> names**, how come they do NOT have identical **class definitions**?
>
> The big surprise (at least to me) is that reference objects, unlike
> traditional S4 objects, CARRY THEIR OWN COPY OF THE CLASS DEFINITION! This
> copy is stored in the '.refClassDef' variable stored in the .xData
> environment of the object:
>
>   ls(a0 using .xData<mailto:a0 using .xData>, all=TRUE)
>   # [1] ".refClassDef" ".self"        "getClass"     "stuff"
>
>   ls(a2 using .xData<mailto:a2 using .xData>, all=TRUE)
>   # [1] ".refClassDef" ".self"        "getClass"     "stuff"
>
> This private copy of the class definition is actually what 'x$getClass()'
> returns:
>
>   identical(a0$getClass(), get(".refClassDef", envir=a0 using .xData<mailto:
> envir=a0 using .xData>))  # TRUE
>   identical(a2$getClass(), get(".refClassDef", envir=a2 using .xData<mailto:
> envir=a2 using .xData>))  # TRUE
>
>
> Problem is that for 'a2' this copy of the class definition is not
> identical to the **original class** definition:
>
>   identical(getClass("A"), a0$getClass())  # TRUE
>   identical(getClass("A"), a2$getClass())  # FALSE
>
>
> And this in turn is because the complicated object that represents the
> class definition also contains environments (e.g.
> 'getClass("A")@refMethods' is an environment) so going thru a
> serialization/deserialization cycle is not a **strict no-op** on it (from
> an identical() perspective).
>
> Replacing the copy of the class definition stored in 'a2' with the
> original class definition makes the problem go away:
>
>   rm(".refClassDef", envir=a2 using .xData<mailto:envir=a2 using .xData>)
>   assign(".refClassDef", getClass("A"), envir=a2 using .xData<mailto:envir=a2@
> .xData>)
>   all.equal(a0, a2)  # TRUE
>
>
> Bottom line: the test 'identical(target$getClass(), current$getClass())'
> performed by all.equal.envRefClass() seems too stringent. It should
> probably be replaced with something a little bit more tolerant i.e.
> something that considers environments that live at different addresses but
> have the same content to be equal. Looks like
> 'isTRUE(all.equal(target$getClass(), current$getClass()))' could do the job.
>
> Finally note that, in addition to the above test, all.equal.envRefClass()
> also does this test (slightly simplified here):
>
>   if (!isTRUE(all.equal(class(target), class(current))))
>       return(sprintf("Classes differ: %s", ...))
>
>
> Maybe that's all what it needs to do to compare the classes of the 2
> objects? (Ironically this test uses all.equal() when it could use
> identical().)
>
> Michael?
>
> H.
>
>
> On 5/11/19 15:09, Aaron Lun wrote:
> I would say it's much worse than mismatching class definitions.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=
> -A
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=-A>
>
> On 5/11/19 5:07 AM, Martin Morgan wrote:
> I think it has to do with the use of reference classes in the assay slot,
> which have different environments
>
>    se = SummarizedExperiment()
>    saveRDS(se, fl <- tempfile())
>    se1 = readRDS(fl)
>
> and then
>
> all.equal(se using assays, se1 using assays)
> [1] "Class definitions are not identical"
> all.equal(se using assays@.xData<mailto:se using assays@.xData>, se1 using assays
> @.xData<mailto:se1 using assays@.xData>)
> [1] "Component \".self\": Class definitions are not identical"
> se using assays@.xData<mailto:se using assays@.xData>
> <environment: 0x7fb1de1ede90>
> se1 using assays@.xData<mailto:se1 using assays@.xData>
> <environment: 0x7fb1fc2bca78>
>
> Martin
>
> On 5/11/19, 6:38 AM, "Bioc-devel on behalf of Laurent Gatto" <
> bioc-devel-bounces using r-project.org on behalf of laurent.gatto using uclouvain.be
> ><mailto:bioc-devel-bounces using r-project.orgonbehalfoflaurent.gatto@
> uclouvain.be> wrote:
>
>      I would appreciate some background about the following:
>           > suppressPackageStartupMessages(library("SummarizedExperiment"))
>      > set.seed(1L)
>      > m <- matrix(rnorm(16), ncol = 4, dimnames = list(letters[1:4],
> LETTERS[1:4]))
>      > rowdata <- DataFrame(X = 1:4, row.names = letters[1:4])
>      > se1 <- SummarizedExperiment(m, rowData = rowdata)
>      > se2 <- SummarizedExperiment(m, rowData = rowdata)
>      > all.equal(se1, se2)
>      [1] TRUE
>           But after serialising and reading se2, the two instances aren't
> equal any more:
>           > saveRDS(se2, file = "se2.rds")
>      > rm(se2)
>      > se2 <- readRDS("se2.rds")
>      > all.equal(se1, se2)
>      [1] "Attributes: < Component “assays”: Class definitions are not
> identical >"
>           Session information provided below.
>           Thank you in advance,
>           Laurent
>                R version 3.6.0 RC (2019-04-21 r76417)
>      Platform: x86_64-pc-linux-gnu (64-bit)
>      Running under: Ubuntu 18.04.2 LTS
>           Matrix products: default
>      BLAS:   /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
>      LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
>           locale:
>       [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>       [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8
>       [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8
>       [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C
>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>      [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
>           attached base packages:
>      [1] parallel  stats4    stats     graphics  grDevices utils
>  datasets
>      [8] methods   base
>           other attached packages:
>       [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
>       [3] BiocParallel_1.18.0         matrixStats_0.54.0
>       [5] Biobase_2.44.0              GenomicRanges_1.36.0
>       [7] GenomeInfoDb_1.20.0         IRanges_2.18.0
>       [9] S4Vectors_0.22.0            BiocGenerics_0.30.0
>           loaded via a namespace (and not attached):
>       [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0
>       [4] zlibbioc_1.30.0        XVector_0.24.0         Matrix_1.2-17
>       [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0
>      [10] GenomeInfoDbData_1.2.1
>                _______________________________________________
>      Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing
> list
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
>     _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
>
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org<mailto:hpages using fredhutch.org>
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list