[Bioc-devel] SummarizedExperiments not equal after serialisation
Kasper Daniel Hansen
k@@perd@n|e|h@n@en @end|ng |rom gm@||@com
Thu May 16 14:57:22 CEST 2019
Interesting detective work. This is nasty.
Best,
Kasper
On Thu, May 16, 2019 at 2:19 AM Pages, Herve <hpages using fredhutch.org> wrote:
> Let's try to go to the bottom of this. But let's leave
> SummarizedExperiment objects out of the picture for now and focus on what
> happens with a very simple reference object.
>
> When you create 2 instances of a reference class with the same content:
>
> A <- setRefClass("A", fields=c(stuff="ANY"))
> a0 <- A(stuff=letters)
> a1 <- A(stuff=letters)
>
>
> the .xData slot (which is an environment) is "different" between the 2
> instances in the sense that the 2 environments live at different addresses
> in memory:
>
> a0 using .xData<mailto:a0 using .xData> # <environment:
> 0x3812150>
> a1 using .xData<mailto:a1 using .xData> # <environment:
> 0x381c7e0>
> identical(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>) #
> FALSE
>
>
> However their **content** is the same:
>
> all.equal(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>) #
> TRUE
>
>
> and the 2 objects are considered equal:
>
> all.equal(a0, a1) # TRUE
>
>
> When the **content** of the 2 objects differ, all.equal() sees 2
> environments with different contents:
>
> b <- A(stuff=LETTERS)
> isTRUE(all.equal(a0 using .xData<mailto:a0 using .xData>, b using .xData<mailto:b using .xData>))
> # FALSE
>
> and no longer considers the 2 objects equal:
>
> all.equal(a0, b) # "Component “stuff”: 26 string
> mismatches"
>
>
> So far so good.
>
> When an object goes thru a serialization/deserialization cycle:
>
> saveRDS(a0, "a0.rds")
> a2 <- readRDS("a0.rds")
>
>
> the .xData slot of the restored object also lives at a different address:
>
> a2 using .xData<mailto:a2 using .xData> # <environment:
> 0x3944668>
> identical(a0 using .xData<mailto:a0 using .xData>, a2 using .xData<mailto:a2 using .xData>) #
> FALSE
>
>
> (This is what serialization/deserialization does on environments so is
> expected.)
>
> So in that aspect 'a2' is no different from 'a1'. However for 'a2' now we
> have:
>
> all.equal(a0, a2) # "Class definitions are not identical"
>
>
> So why is 'all.equal(a0, a2)' doing this? This cannot be explained only by
> the fact that 'a0 using .xData<mailto:a0 using .xData>' and 'a2 using .xData<mailto:a2 using .xData>'
> are non-identical environments.
>
> Looking at the source code for all.equal.envRefClass(), we see something
> like this (slightly simplified here):
>
> ...
> if (!identical(target$getClass(), current$getClass())) {
> ...
> return(sprintf("Class definitions are not identical%s", ...)
> }
> ...
>
>
> So let's try this:
>
> identical(a0$getClass(), a1$getClass()) # TRUE
> identical(a0$getClass(), a2$getClass()) # FALSE
>
> Note that 'x$getClass()' is not the same as 'class(x)'. The latter returns
> the **class name** while the former returns the **class definition** (which
> is represented by a complicated object of class refClassRepresentation).
>
> 'a0' and 'a2' have identical class names:
>
> class(a0)
> # [1] "A"
> # attr(,"package")
> # [1] ".GlobalEnv"
>
> class(a2)
> # [1] "A"
> # attr(,"package")
> # [1] ".GlobalEnv"
>
> identical(class(a0), class(a2))
> # [1] TRUE
>
>
> So now the question is: even though 'a0' and 'a2' have identical **class
> names**, how come they do NOT have identical **class definitions**?
>
> The big surprise (at least to me) is that reference objects, unlike
> traditional S4 objects, CARRY THEIR OWN COPY OF THE CLASS DEFINITION! This
> copy is stored in the '.refClassDef' variable stored in the .xData
> environment of the object:
>
> ls(a0 using .xData<mailto:a0 using .xData>, all=TRUE)
> # [1] ".refClassDef" ".self" "getClass" "stuff"
>
> ls(a2 using .xData<mailto:a2 using .xData>, all=TRUE)
> # [1] ".refClassDef" ".self" "getClass" "stuff"
>
> This private copy of the class definition is actually what 'x$getClass()'
> returns:
>
> identical(a0$getClass(), get(".refClassDef", envir=a0 using .xData<mailto:
> envir=a0 using .xData>)) # TRUE
> identical(a2$getClass(), get(".refClassDef", envir=a2 using .xData<mailto:
> envir=a2 using .xData>)) # TRUE
>
>
> Problem is that for 'a2' this copy of the class definition is not
> identical to the **original class** definition:
>
> identical(getClass("A"), a0$getClass()) # TRUE
> identical(getClass("A"), a2$getClass()) # FALSE
>
>
> And this in turn is because the complicated object that represents the
> class definition also contains environments (e.g.
> 'getClass("A")@refMethods' is an environment) so going thru a
> serialization/deserialization cycle is not a **strict no-op** on it (from
> an identical() perspective).
>
> Replacing the copy of the class definition stored in 'a2' with the
> original class definition makes the problem go away:
>
> rm(".refClassDef", envir=a2 using .xData<mailto:envir=a2 using .xData>)
> assign(".refClassDef", getClass("A"), envir=a2 using .xData<mailto:envir=a2@
> .xData>)
> all.equal(a0, a2) # TRUE
>
>
> Bottom line: the test 'identical(target$getClass(), current$getClass())'
> performed by all.equal.envRefClass() seems too stringent. It should
> probably be replaced with something a little bit more tolerant i.e.
> something that considers environments that live at different addresses but
> have the same content to be equal. Looks like
> 'isTRUE(all.equal(target$getClass(), current$getClass()))' could do the job.
>
> Finally note that, in addition to the above test, all.equal.envRefClass()
> also does this test (slightly simplified here):
>
> if (!isTRUE(all.equal(class(target), class(current))))
> return(sprintf("Classes differ: %s", ...))
>
>
> Maybe that's all what it needs to do to compare the classes of the 2
> objects? (Ironically this test uses all.equal() when it could use
> identical().)
>
> Michael?
>
> H.
>
>
> On 5/11/19 15:09, Aaron Lun wrote:
> I would say it's much worse than mismatching class definitions.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=
> -A
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=-A>
>
> On 5/11/19 5:07 AM, Martin Morgan wrote:
> I think it has to do with the use of reference classes in the assay slot,
> which have different environments
>
> se = SummarizedExperiment()
> saveRDS(se, fl <- tempfile())
> se1 = readRDS(fl)
>
> and then
>
> all.equal(se using assays, se1 using assays)
> [1] "Class definitions are not identical"
> all.equal(se using assays@.xData<mailto:se using assays@.xData>, se1 using assays
> @.xData<mailto:se1 using assays@.xData>)
> [1] "Component \".self\": Class definitions are not identical"
> se using assays@.xData<mailto:se using assays@.xData>
> <environment: 0x7fb1de1ede90>
> se1 using assays@.xData<mailto:se1 using assays@.xData>
> <environment: 0x7fb1fc2bca78>
>
> Martin
>
> On 5/11/19, 6:38 AM, "Bioc-devel on behalf of Laurent Gatto" <
> bioc-devel-bounces using r-project.org on behalf of laurent.gatto using uclouvain.be
> ><mailto:bioc-devel-bounces using r-project.orgonbehalfoflaurent.gatto@
> uclouvain.be> wrote:
>
> I would appreciate some background about the following:
> > suppressPackageStartupMessages(library("SummarizedExperiment"))
> > set.seed(1L)
> > m <- matrix(rnorm(16), ncol = 4, dimnames = list(letters[1:4],
> LETTERS[1:4]))
> > rowdata <- DataFrame(X = 1:4, row.names = letters[1:4])
> > se1 <- SummarizedExperiment(m, rowData = rowdata)
> > se2 <- SummarizedExperiment(m, rowData = rowdata)
> > all.equal(se1, se2)
> [1] TRUE
> But after serialising and reading se2, the two instances aren't
> equal any more:
> > saveRDS(se2, file = "se2.rds")
> > rm(se2)
> > se2 <- readRDS("se2.rds")
> > all.equal(se1, se2)
> [1] "Attributes: < Component “assays”: Class definitions are not
> identical >"
> Session information provided below.
> Thank you in advance,
> Laurent
> R version 3.6.0 RC (2019-04-21 r76417)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.2 LTS
> Matrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
> LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] parallel stats4 stats graphics grDevices utils
> datasets
> [8] methods base
> other attached packages:
> [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
> [3] BiocParallel_1.18.0 matrixStats_0.54.0
> [5] Biobase_2.44.0 GenomicRanges_1.36.0
> [7] GenomeInfoDb_1.20.0 IRanges_2.18.0
> [9] S4Vectors_0.22.0 BiocGenerics_0.30.0
> loaded via a namespace (and not attached):
> [1] lattice_0.20-38 bitops_1.0-6 grid_3.6.0
> [4] zlibbioc_1.30.0 XVector_0.24.0 Matrix_1.2-17
> [7] tools_3.6.0 RCurl_1.95-4.12 compiler_3.6.0
> [10] GenomeInfoDbData_1.2.1
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing
> list
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
>
> _______________________________________________
> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org<mailto:hpages using fredhutch.org>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list