[Rd] Clearing attributes returns ALTREP, serialize still saves them
Gabriel Becker
g@bembecker @end|ng |rom gm@||@com
Sat Jul 3 08:22:16 CEST 2021
Ok, a bit more:
The relevant bit in serialize.c that I can see is:
if (ALTREP(s) && stream->version >= 3) {
SEXP info = ALTREP_SERIALIZED_CLASS(s);
SEXP state = ALTREP_SERIALIZED_STATE(s);
if (info != NULL && state != NULL) {
int flags = PackFlags(ALTREP_SXP, LEVELS(s), OBJECT(s), 0, 0);
PROTECT(state);
PROTECT(info);
OutInteger(stream, flags);
WriteItem(info, ref_table, stream);
* WriteItem(state, ref_table, stream);*
WriteItem(ATTRIB(s), ref_table, stream);
UNPROTECT(2); /* state, info */
return;
}
/* else fall through to standard processing */
}
And in the wrapper altclass, we have:
*static SEXP wrapper_Serialized_state(SEXP x)*
{
return CONS(*WRAPPER_WRAPPED(x)*, WRAPPER_METADATA(x));
}
So whats happening, is that the data isn't being written out during the
WriteItem(ATTRIB(s)), that actually has the correct attribute value. Its
being written out in the bolden line above that, the state, which has the
wrapped SEXP, which ITSELF has the attributes on it, but is not an ALTREP,
so that goes through standard processing, which writes out the attributes
as normal.
So that, I believe, is what needs to change. One possibility is that
wrapper_Serialized_state can be made smarter so that the inner attributes
are duplicated and then wiped clean for any that are overridden by the
attributes on the wrapper. Another option is that the ALTREP WriteItem
section could be made smarter, but that seems less robust.
Finally, the wrapper might be able to be modified in such a way that
setting the attribute on the wrapper clears taht attribute on the wrapped
value, if present. .
I think making wrapper_Serialized_state smarter is the right way to attack
this, and thats the first thing I'll try when I get to it, but if someone
tackles it before me hopefully this digging helped some.
Best,
~G
On Fri, Jul 2, 2021 at 10:18 PM Gabriel Becker <gabembecker using gmail.com>
wrote:
> Hi all,
>
> I don't have a solution yet, but a bit more here:
>
> > .Internal(inspect(x2b))
>
> @7f913826d590 14 REALSXP g0c0 [REF(1)] wrapper [srt=-2147483648,no_na=0]
>
> @7f9137500320 14 REALSXP g0c7 [REF(2),ATT] (len=100, tl=0)
> 0.45384,0.926371,0.838637,-1.71485,-0.719073,...
>
> ATTRIB:
>
> @7f913826dc20 02 LISTSXP g0c0 [REF(1)]
>
> TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(460)] "data"
>
> @7f9118310000 14 REALSXP g0c7 [REF(2)] (len=1000000, tl=0)
> 0.66682,0.480576,-1.13229,0.453313,-0.819498,...
>
> > attr(x2b, "data") <- "small"
>
> > .Internal(inspect(x2b))
>
> @7f913826d590 14 REALSXP g0c0 [REF(1),ATT] wrapper
> [srt=-2147483648,no_na=0]
>
> @7f9137500320 14 REALSXP g0c7 [REF(2),ATT] (len=100, tl=0)
> 0.45384,0.926371,0.838637,-1.71485,-0.719073,...
>
> ATTRIB:
>
> @7f913826dc20 02 LISTSXP g0c0 [REF(1)]
>
> TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(461)] "data"
>
> @7f9118310000 14 REALSXP g0c7 [REF(2)] (len=1000000, tl=0)
> 0.66682,0.480576,-1.13229,0.453313,-0.819498,...
>
> ATTRIB:
>
> @7f913826c870 02 LISTSXP g0c0 [REF(1)]
>
> TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(461)] "data"
>
> @7f9120580850 16 STRSXP g0c1 [REF(3)] (len=1, tl=0)
>
> @7f91205808c0 09 CHARSXP g0c1 [REF(3),gp=0x60] [ASCII] [cached]
> "small"
>
>
> So we can see that the assignment of attr(x2b, "data") IS doing something,
> but it isn't doing the right thing. The fact that the above code assigned
> null instead of a value was hiding this.
>
>
> I will dig into this more if someone doesn't get it fixed before me, but
> it won't be until after useR, because I'm preparing multiple talks for that
> and it is this coming week.
>
>
> Best,
>
> ~G
>
> On Fri, Jul 2, 2021 at 9:15 PM Zafer Barutcuoglu <
> zafer.barutcuoglu using gmail.com> wrote:
>
>> Hi all,
>>
>> Setting names/dimnames on vectors/matrices of length>=64 returns an
>> ALTREP wrapper which internally still contains the names/dimnames, and
>> calling base::serialize on the result writes them out. They are
>> unserialized in the same way, with the names/dimnames hidden in the ALTREP
>> wrapper, so the problem is not obvious except in wasted time, bandwidth, or
>> disk space.
>>
>> Example:
>> v1 <- setNames(rnorm(64), paste("element name", 1:64))
>> v2 <- unname(v1)
>> names(v2)
>> # NULL
>> length(serialize(v1, NULL))
>> # [1] 2039
>> length(serialize(v2, NULL))
>> # [1] 2132
>> length(serialize(v2[TRUE], NULL))
>> # [1] 543
>>
>> con <- rawConnection(raw(), "w")
>> serialize(v2, con)
>> v3 <- unserialize(rawConnectionValue(con))
>> names(v3)
>> # NULL
>> length(serialize(v3, NULL))
>> # 2132
>>
>> # Similarly for matrices:
>> m1 <- matrix(rnorm(64), 8, 8, dimnames=list(paste("row name", 1:8),
>> paste("col name", 1:8)))
>> m2 <- unname(m1)
>> dimnames(m2)
>> # NULL
>> length(serialize(m1, NULL))
>> # [1] 918
>> length(serialize(m2, NULL))
>> # [1] 1035
>> length(serialize(m2[TRUE, TRUE], NULL))
>> # 582
>>
>> Previously discussed here, too:
>> https://r.789695.n4.nabble.com/Invisible-names-problem-td4764688.html
>>
>> This happens with other attributes as well, but less predictably:
>> x1 <- structure(rnorm(100), data=rnorm(1000000))
>> x2 <- structure(x1, data=NULL)
>> length(serialize(x1, NULL))
>> # [1] 8000952
>> length(serialize(x2, NULL))
>> # [1] 924
>>
>> x1b <- rnorm(100)
>> attr(x1b, "data") <- rnorm(1000000)
>> x2b <- x1b
>> attr(x2b, "data") <- NULL
>> length(serialize(x1b, NULL))
>> # [1] 8000863
>> length(serialize(x2b, NULL))
>> # [1] 8000956
>>
>> This is pretty severe, trying to track down why serializing a small
>> object kills the network, because of which large attributes it may have
>> once had during its lifetime around the codebase that are still secretly
>> tagging along.
>>
>> Is there a plan to resolve this? Any suggestions for maybe a C++
>> workaround until then? Or an alternative performant serialization solution?
>>
>> Best,
>> --
>> Zafer
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list