[Rd] Objectsize function visiting every element for alt-rep strings

Tierney, Luke |uke-t|erney @end|ng |rom u|ow@@edu
Tue Jan 22 17:21:09 CET 2019


On Mon, 21 Jan 2019, Martin Maechler wrote:

>>>>>> Travers Ching
>>>>>>     on Tue, 15 Jan 2019 12:50:45 -0800 writes:
>
>    > I have a toy alt-rep string package that generates
>    > randomly seeded strings.  example: library(altstringisode)
>    > x <- altrandomStrings(1e8) head(x) [1]
>    > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1"
>    > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8)
>
>    > Object.size will call the set_altstring_Elt_method for
>    > every single element, materializing (slowly) every element
>    > of the vector.  This is a problem mostly in R-studio since
>    > object.size is called automatically, defeating the purpose
>    > of alt-rep.

There is no sensible way in general to figure out how large the
strings would be without computing them. There might be specifically
for a deferred sequence conversion but it would require a fair bit of
effort to figure out that would be better spent elsewhere.

I've never been a big fan of object.size since what it is trying to
compute isn't very well defined in the context of sharing and possible
internal state changes (even before ALTREP byte code compilation could
change the internals of a function [which object.size sees] and
assigning into environments or evaluating promises can change
environments [which object.size ignores]). The issue is not unlike the
one faced by identical(), which has a bunch of options for the
different ways objects can be identical, and might need even more.

We could in general have object.size for and ALTREP return the
object.size results of the current internal representation, but that
might not always be appropriate. Again, what object.size is trying to
compute isn't very well defined.

RStudio does seem to call object.size on every assignment to
.GlobalEnv. That might be worth revisiting.


Best,

luke

>
> Hmm.  But still, the idea had been that object.size()  *shuld*
> return the size of the "de-ALTREP'ed" object *but* should not
> de-ALTREP it.
> That's what happens for integers, but indeed fails to happen for
> such as.character(.)ed integers.
>
> From my eRum presentation (which took from the official ALTREP documentation
> https://svn.r-project.org/R/branches/ALTREP/ALTREP.html ) :
>
>  > x <- 1:1e15
>  > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not really
>  8000000000000048 bytes
>  > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
>  [1] FALSE
>  > xs <- sort(x)  #
>  > .Internal(inspect(x))
>  @80255f8 14 REALSXP g0c0 [NAM(7)]  1 : 1000000000000000 (compact)
>  >
>
>  > cx <- as.character(x)
>  > .Internal(inspect(cx))
>  @80485d8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
>    @80255f8 14 REALSXP g1c0 [MARK,NAM(7)]  1 : 1000000000000000 (compact)
>  > system.time( print(object.size(x)), gc=FALSE)
>  8000000000000048 bytes
>     user  system elapsed
>    0.000   0.000   0.001
>  > system.time( print(object.size(cx)), gc=FALSE)
>  Error: cannot allocate vector of size 8388608.0 Gb
>  Timing stopped at: 11.43 0 11.46
>  >
>
> One could consider it a bug that object.size(cx) is indeed
> inspecting every string, i.e., accessing cx[i] for all i.
> Note that it is *not*  deALTREPing cx  itself :
>
>> x <- 1:1e6
>> cx <- as.character(x)
>> .Internal(inspect(cx))
>
> @7f5b1a0 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
>  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
>> system.time( print(object.size(cx)), gc=FALSE)
> 64000048 bytes
>   user  system elapsed
>  0.369   0.005   0.374
>> .Internal(inspect(cx))
> @7f5b1a0 16 STRSXP g0c0 [NAM(7)]   <deferred string conversion>
>  @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
>>
>
>    > Is there a way to avoid the problem of forced
>    > materialization in rstudio?
>
>    > PS: Is there a way to tell if a post has been received by
>    > the mailing list?  How long does it take to show up in the
>    > archives?
>
> [ that (waiting time) distribution is quite right skewed... I'd
>  guess it's median to be less than 10 minutes... but we had
>  artificially delayed it somewhat in the past to fight
>  spammers, and ETH (the hosting instituttion) and others have
>  increased spam and virus filtering so everything has become
>  quite a bit slower ]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



More information about the R-devel mailing list