[Rd] modifying large R objects in place

Petr Savicky savicky at cs.cas.cz
Sat Sep 29 10:28:34 CEST 2007


On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote:
[...]
> [...] A related issue is that user-defined
> assignment functions always see a NAMED of 2 and hence cannot modify
> in place. We've been trying to come up with a reasonable solution to
> this, so far without success but I'm moderately hopeful.

If a user-defined function evaluates its body in its parent environment
using the suggestion of Peter Dalgaard eval.parent(substitute( .... )),
then NAMED attribute is not increased and the function may do in place
modifications.

On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
> Longer-term, I still have some hope for better reference counting, but 
> the semantics of environments make it really ugly -- an environment can 
> contain an object that contains the environment, a simple example being 
> 
> f <- function()
>    g <- function() 0
> f()
> 

On Fri, Sep 28, 2007 at 09:46:39AM -0400, Duncan Murdoch wrote:
> f has no input; it's output is the function g, whose environment is the 
> evaluation environment of f.  g is never used, but it is returned as the 
> value of f.  Thus we have the loop:
> 
> g refers to the environment.
> the environment contains g.
> 
> Even though the result of f() was never saved, two things (the 
> environment and g) got created and each would have non-zero reference 
> count.

Thank you very much for the example and explanation. I would
not guess, something like this is possible, but now I see that
it may, in fact, be quite common. For example
  something <- function()
  {
      a <- 1:5
      b <- 6:10
      c <- c("a","a","b","b","b")
      mf <- model.frame(c ~ a + b)
      mf
  }
  mf1 <- something()
  e1 <- attr(attr(mf1,"terms"),".Environment")
  mf2 <- eval(expression(mf),envir=e1)
  e2 <- attr(attr(mf2,"terms"),".Environment")
  print(identical(e1,e2)) # TRUE
seems to be a similar situation. Here, the references go in the
sequence mf1 -> e1 -> mf2 -> e1. I think that already mf2 is
the same as mf1, but I do not know how to demonstrate this.
However, both mf1 and mf2 refer to the same environment, so
e1 -> mf2 -> e1 is a cycle for sure.

On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote:
> >If yes, is it possible during gc() to determine also cases,
> >when NAMED may be dropped from 2 to 1? How much would this increase
> >the complexity of gc()?
> 
> Probably not impossible but would be a fair bit of work with probably
> not much gain as the NAMED values would still be high until the next
> gc of the appropriate level, which will probably be a fair time as an
> object being modified is likely to be older, but the interval in which
> there would be a benefit is short.

On Fri, Sep 28, 2007 at 04:36:40PM +0100, Prof Brian Ripley wrote:
[...]
> On Fri, 28 Sep 2007, Luke Tierney wrote:
[...]
> >approach may be possible. A related issue is that user-defined
> >assignment functions always see a NAMED of 2 and hence cannot modify
> >in place. We've been trying to come up with a reasonable solution to
> >this, so far without success but I'm moderately hopeful.
> 
> I am not persuaded that the difference between NAMED=1/2 makes much 
> difference in general use of R, and I recall Ross saying that he no longer 
> believed that this was a worthwhile optimization.  It's not just 
> 'user-defined' replacement functions, but also all the system-defined 
> closures (including all methods for the generic replacement functions 
> which are primitive) that are unable to benefit from it.

I am thinking about the following situation. The user creates a large
matrix A and then performs a sequence of operations on it. Some of
the operations scan the matrix in a read-only manner (calculating e.g.
some summaries), some operations are top level commands, which modify the
matrix itself. I do not argue that such a sequence of operations should
be done in place by default. However, I think that R should provide
tools, which allow to do this in place, if the user does some extra
work. If the matrix is really large, then in place operations are not
only more space efficient, but also more time efficient.

Using the information from the current thread, there are two
possible approaches to reach this.

1. The initial matrix should not be generated by "matrix" function
   due to the observation by Henrik Bengtsson (this is the issue
   with dimnames). The matrix may be initiated using e.g.
     .Internal(matrix(data, nrow, ncol, byrow))

   The matrix should not be scanned using an R function, which evaluates
   its body in its own enviroment. This includes functions nrow, ncol,
   colSums, rowSums and probaly more. The matrix may be scanned by
   functions, which use eval.parent(substitute( .... )) and avoid giving
   the matrix a new name. The user may prepare versions of nrow, ncol,
   colSums, rowSums, etc. with this property.

2. If NAMED attribute of A may be decreased from 2 to 1 during an operation
   similar to garbage collection (if A is not in a reference cycle), then the
   above approach may be combined also with operations, which work themselves
   in place and read only, but increase NAMED(A) as a side effect. In this
   case, the user should explicitly invoke the "NAMED reduction" after such
   operations. If the user has only a small number of large objects, then
   gc() is faster then duplication of some of the large things. So, I expect
   that the "NAMED reduction" could be also more time efficient than some
   of the unwanted duplications.

During the previous discussion, the exact counting of references was
sometimes mentioned. So, I want to explicitly state that I do not
think, it is a good idea. In my opinion, it is definitely not reasonable
now. I am very satisfied with the stability of R sessions and this
would be in danger during the transition to full counting. Moreover,
I can imagine (I am not an expert on this) that the efficiency and
simplicity benefit of the guaranteed approximate counting outweigh the
disadvantages (a bit more duplications than necessary) in a typical R
session.

However, there are situations, where the cost of duplication is
too high and the user knows about it in advance. In such situations,
having more tools for explicit control of duplication could help.
The tools may be, for example, some function, which allows a simple
query to the NAMED status of a given object on R level and
modifying some of the built-in functions to be more careful with
NAMED attribute. A possible strengthening of gc() would, of course, be
very useful here. I think about an explicit use of it, not about
the automatical runs. So, for safety reasons, the "NAMED reduction"
could be done by a different function, not the default gc() itself.

Petr Savicky.



More information about the R-devel mailing list