[Rd] modifying large R objects in place

Fri Sep 28 13:45:16 CEST 2007

On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
[...]
> >nrow <- function(...) dim(...)[1]
> >ncol <- function(...) dim(...)[2]
> >
> >At least in my environment, the new versions preserved NAMED == 1.
> >  
> Yes, but changing the formal arguments is a bit messy, is it not?

Specifically for nrow, ncol, I think not much, since almost nobody needs
to know (or even knows) that the name of the formal argument is "x".

However, there is another argument against the ... solution: it solves
the problem only in the simplest cases like nrow, ncol, but is not
usable in other, like colSums, rowSums. These functions also increase
NAMED of its argument, although their output does not contain any
reference to the original content of their arguments.

I think that a systematic solution of this problem may be helpful.
However, making these functions Internal or Primitive would
not be good in my opinion. It is advantageous that these functions
contain an R level part, which
makes the basic decisions before a call to .Internal.
If nothing else, this serves as a sort of documentation.

For my purposes, I replaced calls to "colSums" and "matrix" by the
corresponding calls to .Internal in my script. The result is that
now I can complete several runs of my calculation in a cycle instead
of restarting R after each of the runs.

This leads me to a question. Some of the tests, which I did, suggest
that gc() may not free all the memory, even if I remove all data
objects by rm() before calling gc(). Is this possible or I must have
missed something?

A possible solution to the unwanted increase of NAMED due to temporary
calculations could be to give the user the possibility
to store NAMED attribute of an object before a call to a function
and restore it after the call. To use this, the user should be
confident that no new reference to the object persists after the
function is completed.

> Presumably, nrow <- function(x) eval.parent(substitute(dim(x)[1])) works 
> too, but if the gain is important enough to warrant that sort of 
> programming, you might as well make nrow a .Primitive.

You are right. This indeed works.

> Longer-term, I still have some hope for better reference counting, but 
> the semantics of environments make it really ugly -- an environment can 
> contain an object that contains the environment, a simple example being 
> 
> f <- function()
>    g <- function() 0
> f()
> 
> At the end of f(), we should decide whether to destroy f's evaluation 
> environment. In the present example, what we need to be able to see is 
> that this would remove all refences to g and that the reference from g 
> to f can therefore be ignored.  Complete logic for sorting this out is 
> basically equivalent to a new garbage collector, and one can suspect 
> that applying the logic upon every function return is going to be 
> terribly inefficient. However, partial heuristics might apply.

I have to say that I do not understand the example very much.
What is the input and output of f? Is g inside only defined or
also used?

Let me ask the following question. I assume that gc() scans the whole
memory and determines for each part of data, whether a reference
to it still exists or not. In my understanding, this is equivalent to
determine, whether NAMED of it may be dropped to zero or not.
Structures, for which this succeeds are then removed. Am I right?
If yes, is it possible during gc() to determine also cases,
when NAMED may be dropped from 2 to 1? How much would this increase
the complexity of gc()?

Thank you in advance for your kind reply.

Petr Savicky.