[Bioc-devel] SummarizedExperiments not equal after serialisation
Pages, Herve
hp@ge@ @end|ng |rom |redhutch@org
Thu May 16 08:19:00 CEST 2019
Let's try to go to the bottom of this. But let's leave SummarizedExperiment objects out of the picture for now and focus on what happens with a very simple reference object.
When you create 2 instances of a reference class with the same content:
A <- setRefClass("A", fields=c(stuff="ANY"))
a0 <- A(stuff=letters)
a1 <- A(stuff=letters)
the .xData slot (which is an environment) is "different" between the 2 instances in the sense that the 2 environments live at different addresses in memory:
a0 using .xData<mailto:a0 using .xData> # <environment: 0x3812150>
a1 using .xData<mailto:a1 using .xData> # <environment: 0x381c7e0>
identical(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>) # FALSE
However their **content** is the same:
all.equal(a0 using .xData<mailto:a0 using .xData>, a1 using .xData<mailto:a1 using .xData>) # TRUE
and the 2 objects are considered equal:
all.equal(a0, a1) # TRUE
When the **content** of the 2 objects differ, all.equal() sees 2 environments with different contents:
b <- A(stuff=LETTERS)
isTRUE(all.equal(a0 using .xData<mailto:a0 using .xData>, b using .xData<mailto:b using .xData>)) # FALSE
and no longer considers the 2 objects equal:
all.equal(a0, b) # "Component “stuff”: 26 string mismatches"
So far so good.
When an object goes thru a serialization/deserialization cycle:
saveRDS(a0, "a0.rds")
a2 <- readRDS("a0.rds")
the .xData slot of the restored object also lives at a different address:
a2 using .xData<mailto:a2 using .xData> # <environment: 0x3944668>
identical(a0 using .xData<mailto:a0 using .xData>, a2 using .xData<mailto:a2 using .xData>) # FALSE
(This is what serialization/deserialization does on environments so is expected.)
So in that aspect 'a2' is no different from 'a1'. However for 'a2' now we have:
all.equal(a0, a2) # "Class definitions are not identical"
So why is 'all.equal(a0, a2)' doing this? This cannot be explained only by the fact that 'a0 using .xData<mailto:a0 using .xData>' and 'a2 using .xData<mailto:a2 using .xData>' are non-identical environments.
Looking at the source code for all.equal.envRefClass(), we see something like this (slightly simplified here):
...
if (!identical(target$getClass(), current$getClass())) {
...
return(sprintf("Class definitions are not identical%s", ...)
}
...
So let's try this:
identical(a0$getClass(), a1$getClass()) # TRUE
identical(a0$getClass(), a2$getClass()) # FALSE
Note that 'x$getClass()' is not the same as 'class(x)'. The latter returns the **class name** while the former returns the **class definition** (which is represented by a complicated object of class refClassRepresentation).
'a0' and 'a2' have identical class names:
class(a0)
# [1] "A"
# attr(,"package")
# [1] ".GlobalEnv"
class(a2)
# [1] "A"
# attr(,"package")
# [1] ".GlobalEnv"
identical(class(a0), class(a2))
# [1] TRUE
So now the question is: even though 'a0' and 'a2' have identical **class names**, how come they do NOT have identical **class definitions**?
The big surprise (at least to me) is that reference objects, unlike traditional S4 objects, CARRY THEIR OWN COPY OF THE CLASS DEFINITION! This copy is stored in the '.refClassDef' variable stored in the .xData environment of the object:
ls(a0 using .xData<mailto:a0 using .xData>, all=TRUE)
# [1] ".refClassDef" ".self" "getClass" "stuff"
ls(a2 using .xData<mailto:a2 using .xData>, all=TRUE)
# [1] ".refClassDef" ".self" "getClass" "stuff"
This private copy of the class definition is actually what 'x$getClass()' returns:
identical(a0$getClass(), get(".refClassDef", envir=a0 using .xData<mailto:envir=a0 using .xData>)) # TRUE
identical(a2$getClass(), get(".refClassDef", envir=a2 using .xData<mailto:envir=a2 using .xData>)) # TRUE
Problem is that for 'a2' this copy of the class definition is not identical to the **original class** definition:
identical(getClass("A"), a0$getClass()) # TRUE
identical(getClass("A"), a2$getClass()) # FALSE
And this in turn is because the complicated object that represents the class definition also contains environments (e.g. 'getClass("A")@refMethods' is an environment) so going thru a serialization/deserialization cycle is not a **strict no-op** on it (from an identical() perspective).
Replacing the copy of the class definition stored in 'a2' with the original class definition makes the problem go away:
rm(".refClassDef", envir=a2 using .xData<mailto:envir=a2 using .xData>)
assign(".refClassDef", getClass("A"), envir=a2 using .xData<mailto:envir=a2 using .xData>)
all.equal(a0, a2) # TRUE
Bottom line: the test 'identical(target$getClass(), current$getClass())' performed by all.equal.envRefClass() seems too stringent. It should probably be replaced with something a little bit more tolerant i.e. something that considers environments that live at different addresses but have the same content to be equal. Looks like 'isTRUE(all.equal(target$getClass(), current$getClass()))' could do the job.
Finally note that, in addition to the above test, all.equal.envRefClass() also does this test (slightly simplified here):
if (!isTRUE(all.equal(class(target), class(current))))
return(sprintf("Classes differ: %s", ...))
Maybe that's all what it needs to do to compare the classes of the 2 objects? (Ironically this test uses all.equal() when it could use identical().)
Michael?
H.
On 5/11/19 15:09, Aaron Lun wrote:
I would say it's much worse than mismatching class definitions.
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_SummarizedExperiment_issues_16&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=TFNYF_XZCKo4J36DWs2BY1-6PVS18gW3iFTMRNQNDT4&e=
-A
On 5/11/19 5:07 AM, Martin Morgan wrote:
I think it has to do with the use of reference classes in the assay slot, which have different environments
se = SummarizedExperiment()
saveRDS(se, fl <- tempfile())
se1 = readRDS(fl)
and then
all.equal(se using assays, se1 using assays)
[1] "Class definitions are not identical"
all.equal(se using assays@.xData<mailto:se using assays@.xData>, se1 using assays@.xData<mailto:se1 using assays@.xData>)
[1] "Component \".self\": Class definitions are not identical"
se using assays@.xData<mailto:se using assays@.xData>
<environment: 0x7fb1de1ede90>
se1 using assays@.xData<mailto:se1 using assays@.xData>
<environment: 0x7fb1fc2bca78>
Martin
On 5/11/19, 6:38 AM, "Bioc-devel on behalf of Laurent Gatto" <bioc-devel-bounces using r-project.org on behalf of laurent.gatto using uclouvain.be><mailto:bioc-devel-bounces using r-project.orgonbehalfoflaurent.gatto@uclouvain.be> wrote:
I would appreciate some background about the following:
> suppressPackageStartupMessages(library("SummarizedExperiment"))
> set.seed(1L)
> m <- matrix(rnorm(16), ncol = 4, dimnames = list(letters[1:4], LETTERS[1:4]))
> rowdata <- DataFrame(X = 1:4, row.names = letters[1:4])
> se1 <- SummarizedExperiment(m, rowData = rowdata)
> se2 <- SummarizedExperiment(m, rowData = rowdata)
> all.equal(se1, se2)
[1] TRUE
But after serialising and reading se2, the two instances aren't equal any more:
> saveRDS(se2, file = "se2.rds")
> rm(se2)
> se2 <- readRDS("se2.rds")
> all.equal(se1, se2)
[1] "Attributes: < Component “assays”: Class definitions are not identical >"
Session information provided below.
Thank you in advance,
Laurent
R version 3.6.0 RC (2019-04-21 r76417)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
[3] BiocParallel_1.18.0 matrixStats_0.54.0
[5] Biobase_2.44.0 GenomicRanges_1.36.0
[7] GenomeInfoDb_1.20.0 IRanges_2.18.0
[9] S4Vectors_0.22.0 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] lattice_0.20-38 bitops_1.0-6 grid_3.6.0
[4] zlibbioc_1.30.0 XVector_0.24.0 Matrix_1.2-17
[7] tools_3.6.0 RCurl_1.95-4.12 compiler_3.6.0
[10] GenomeInfoDbData_1.2.1
_______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e= _______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
_______________________________________________
Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=lrsb_VI7kmukb5oGfUe0HGsWu0pqT16WOnTOI4Y0JQc&s=5H5vUx8twlV__0HeBhCWd3Fv30MbKQshwjvr8p3zSbs&e=
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org<mailto:hpages using fredhutch.org>
Phone: (206) 667-5791
Fax: (206) 667-1319
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list