[Rd] Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

Sun May 26 01:38:40 CEST 2013

On May 25, 2013, at 3:48 PM, Henrik Bengtsson wrote:

> Hi,
> 
> in my packages/functions/code I tend to remove large temporary
> variables as soon as possible, e.g. large intermediate vectors used in
> iterations.  I sometimes also have the habit of doing this to make it
> explicit in the source code when a temporary object is no longer
> needed.  However, I did notice that this can add a noticeable overhead
> when the rest of the iteration step does not take that much time.
> 
> Trying to speed this up, I first noticed that rm(list="a") is much
> faster than rm(a).  While at it, I realized that for the purpose of
> keeping the memory footprint small, I can equally well reassign the
> variable the value of a small object (e.g. a <- NULL), which is
> significantly faster than using rm().
> 

Yes, as you probably noticed rm() is a quite complex function because it has to deal with different ways to specify input etc. 
When you remove that overhead (by calling .Internal(remove("a", parent.frame(), FALSE))), you get the same performance as the assignment.
If you really want to go overboard, you can define your own function:

SEXP rm(SEXP x, SEXP rho) { setVar(x, R_UnboundValue, rho); return R_NilValue; }
poof <- function(x) .Call(rm_C, substitute(x), parent.frame())

That will be faster than anything else (mainly because it avoids the trip through strings as it can use the symbol directly).

But as Bill noted - it practice I'd recommend using either local() or functions to control the scope - using rm() or assignments seems too error-prone to me.

Cheers,
Simon

> SOME BENCHMARKS:
> A toy example imitating an iterative algorithm with "large" temporary objects.
> 
> x <- matrix(rnorm(100e6), ncol=10e3)
> 
> t1 <- system.time(for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  rm(a) # Not needed anymore
>  b <- x[k,]
>  rowSum <- sum(b)
>  rm(b) # Not needed anymore
> })
> 
> t2 <- system.time(for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  rm(list="a") # Not needed anymore
>  b <- x[k,]
>  rowSum <- sum(b)
>  rm(list="b") # Not needed anymore
> })
> 
> t3 <- system.time(for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  a <- NULL # Not needed anymore
>  b <- x[k,]
>  rowSum <- sum(b)
>  b <- NULL # Not needed anymore
> })
> 
>> t1
>   user  system elapsed
>   8.03    0.00    8.08
>> t1/t2
>    user   system  elapsed
> 1.322900 0.000000 1.320261
>> t1/t3
>    user   system  elapsed
> 1.715812 0.000000 1.662551
> 
> 
> Is there a reason why I shouldn't assign NULL instead of using rm()?
> As far as I understand it, the garbage collector will be equally
> efficient cleaning out the previous object when using rm(a) or a <-
> NULL.  Is there anything else I'm overlooking?  Am I adding overhead
> somewhere else?
> 
> /Henrik
> 
> 
> PS. With the above toy example one can obviously be a bit smarter by using:
> 
> t4 <- system.time({for (k in 1:ncol(x)) {
>  a <- x[,k]
>  colSum <- sum(a)
>  a <- x[k,]
>  rowSum <- sum(a)
> }
> rm(list="a")
> })
> 
> but that's not my point.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>