[Rd] Objectsize function visiting every element for alt-rep strings

Travers Ching tr@ver@c @end|ng |rom gm@||@com
Wed Jan 23 19:27:22 CET 2019


It should be possible to calculate object.size in the presence of
sharing, at least with respect to all sub-nodes of a SEXP.  E.g.,
during calculation, keep a hash of all SEXP pointers visited.  If a
pointer has already been visited, add only the size of the pointer to
the total object size.

Travers

On Wed, Jan 23, 2019 at 1:33 AM Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>
> On 1/22/19 6:17 PM, Kevin Ushey wrote:
> > I think that object.size() is most commonly used to answer the question,
> > "what R objects are consuming the most memory currently in my R session?"
> > and for that reason I think returning the size of the internal
> > representations of objects (for e.g. ALTREP objects; unevaluated promises)
> > is the right default behavior.
>
> I don't think one could answer that question at all in the presence of
> sharing (of objects with value semantics due to copy on write, string
> cache or other caches, sharing of objects with referential semantics
> such as environments, etc). Also the mapping from R objects (SEXPs) to
> what users might understand as objects would not be clear (which SEXPs
> belong to which "object", which SEXPs are too low-level for the user to
> be considered, etc). In principle, there could be a memory profiler
> working at SEXP level and exposing all the intricacies of the memory
> layout, answering reachability questions on a heap dump (so one could
> find out about a 1G integer vector and then list all bindings say in
> namespace environments from which it is reachable), but of course that
> would be a lot of work to implement and to maintain. The problem is not
> unique to R (e.g. see Java with the same problems of sharing that
> prevent meaningful definition for object size). I am not persuaded it
> makes sense to add more options to a function that does not have and
> cannot have a well defined user-level semantics, and I would discourage
> writing code that is trying to build on that function as I think that it
> might lead to confusion and frustration. I think equality for example is
> easier to define (just that one could come up with multiple meaningful
> definitions, so it makes sense to have multiple options).
>
> Best
> Tomas
> >
> > I also agree it would be worth considering adding arguments that control
> > how object.size() is computed for different kinds of R objects, since users
> > might want to use object.size() to answer different types of questions.
> >
> > All that said, if the ultimate goal here is to avoid having RStudio
> > materialize ALTREP objects in the background, then perhaps that change
> > should happen in RStudio :-)
> >
> > Best,
> > Kevin
> >
> > On Tue, Jan 22, 2019 at 8:21 AM Tierney, Luke <luke-tierney using uiowa.edu>
> > wrote:
> >
> >> On Mon, 21 Jan 2019, Martin Maechler wrote:
> >>
> >>>>>>>> Travers Ching
> >>>>>>>>      on Tue, 15 Jan 2019 12:50:45 -0800 writes:
> >>>     > I have a toy alt-rep string package that generates
> >>>     > randomly seeded strings.  example: library(altstringisode)
> >>>     > x <- altrandomStrings(1e8) head(x) [1]
> >>>     > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1"
> >>>     > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8)
> >>>
> >>>     > Object.size will call the set_altstring_Elt_method for
> >>>     > every single element, materializing (slowly) every element
> >>>     > of the vector.  This is a problem mostly in R-studio since
> >>>     > object.size is called automatically, defeating the purpose
> >>>     > of alt-rep.
> >> There is no sensible way in general to figure out how large the
> >> strings would be without computing them. There might be specifically
> >> for a deferred sequence conversion but it would require a fair bit of
> >> effort to figure out that would be better spent elsewhere.
> >>
> >> I've never been a big fan of object.size since what it is trying to
> >> compute isn't very well defined in the context of sharing and possible
> >> internal state changes (even before ALTREP byte code compilation could
> >> change the internals of a function [which object.size sees] and
> >> assigning into environments or evaluating promises can change
> >> environments [which object.size ignores]). The issue is not unlike the
> >> one faced by identical(), which has a bunch of options for the
> >> different ways objects can be identical, and might need even more.
> >>
> >> We could in general have object.size for and ALTREP return the
> >> object.size results of the current internal representation, but that
> >> might not always be appropriate. Again, what object.size is trying to
> >> compute isn't very well defined.
> >>
> >> RStudio does seem to call object.size on every assignment to
> >> .GlobalEnv. That might be worth revisiting.
> >>
> >>
> >> Best,
> >>
> >> luke
> >>
> >>> Hmm.  But still, the idea had been that object.size()  *shuld*
> >>> return the size of the "de-ALTREP'ed" object *but* should not
> >>> de-ALTREP it.
> >>> That's what happens for integers, but indeed fails to happen for
> >>> such as.character(.)ed integers.
> >>>
> >>>  From my eRum presentation (which took from the official ALTREP
> >> documentation
> >>> https://svn.r-project.org/R/branches/ALTREP/ALTREP.html ) :
> >>>
> >>>   > x <- 1:1e15
> >>>   > object.size(x) # 8000'000'000'000'048 bytes : 8000 TBytes -- ok, not
> >> really
> >>>   8000000000000048 bytes
> >>>   > is.unsorted(x) # FALSE : i.e., R's *knows* it is sorted
> >>>   [1] FALSE
> >>>   > xs <- sort(x)  #
> >>>   > .Internal(inspect(x))
> >>>   @80255f8 14 REALSXP g0c0 [NAM(7)]  1 : 1000000000000000 (compact)
> >>>   >
> >>>
> >>>   > cx <- as.character(x)
> >>>   > .Internal(inspect(cx))
> >>>   @80485d8 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
> >>>     @80255f8 14 REALSXP g1c0 [MARK,NAM(7)]  1 : 1000000000000000 (compact)
> >>>   > system.time( print(object.size(x)), gc=FALSE)
> >>>   8000000000000048 bytes
> >>>      user  system elapsed
> >>>     0.000   0.000   0.001
> >>>   > system.time( print(object.size(cx)), gc=FALSE)
> >>>   Error: cannot allocate vector of size 8388608.0 Gb
> >>>   Timing stopped at: 11.43 0 11.46
> >>>   >
> >>>
> >>> One could consider it a bug that object.size(cx) is indeed
> >>> inspecting every string, i.e., accessing cx[i] for all i.
> >>> Note that it is *not*  deALTREPing cx  itself :
> >>>
> >>>> x <- 1:1e6
> >>>> cx <- as.character(x)
> >>>> .Internal(inspect(cx))
> >>> @7f5b1a0 16 STRSXP g0c0 [NAM(1)]   <deferred string conversion>
> >>>   @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
> >>>> system.time( print(object.size(cx)), gc=FALSE)
> >>> 64000048 bytes
> >>>    user  system elapsed
> >>>   0.369   0.005   0.374
> >>>> .Internal(inspect(cx))
> >>> @7f5b1a0 16 STRSXP g0c0 [NAM(7)]   <deferred string conversion>
> >>>   @7f5adb0 13 INTSXP g0c0 [NAM(7)]  1 : 1000000 (compact)
> >>>     > Is there a way to avoid the problem of forced
> >>>     > materialization in rstudio?
> >>>
> >>>     > PS: Is there a way to tell if a post has been received by
> >>>     > the mailing list?  How long does it take to show up in the
> >>>     > archives?
> >>>
> >>> [ that (waiting time) distribution is quite right skewed... I'd
> >>>   guess it's median to be less than 10 minutes... but we had
> >>>   artificially delayed it somewhat in the past to fight
> >>>   spammers, and ETH (the hosting instituttion) and others have
> >>>   increased spam and virus filtering so everything has become
> >>>   quite a bit slower ]
> >>>
> >>> ______________________________________________
> >>> R-devel using r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >> --
> >> Luke Tierney
> >> Ralph E. Wareham Professor of Mathematical Sciences
> >> University of Iowa                  Phone:             319-335-3386
> >> Department of Statistics and        Fax:               319-335-3017
> >>      Actuarial Science
> >> 241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
> >> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> >>
> >> ______________________________________________
> >> R-devel using r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list