[R] R badly lags matlab on performance?

Sun Jan 4 23:31:03 CET 2009

Stavros Macrakis wrote:
> On Sat, Jan 3, 2009 at 7:02 PM,  <luke at stat.uiowa.edu> wrote:
>> R's interpreter is fairly slow due in large part to the allocation of
>> argument lists and the cost of lookups of variables, including ones
>> like [<- that are assembled and looked up as strings on every call.
> 
> Wow, I had no idea the interpreter was so awful. Just some simple
> tree-to-tree transformations would speed things up, I'd think, e.g.
> `<-`(`[`(...), ...) ==> `<-[`(...,...).

Doesn't really help (and it's not quite correct: a[2] <- 1 is equivalent to

a <- `[<-`(a, 2,  1)

with some sneakiness that assumes that the two a's are the same, so that 
you might destructively modify the second instance.)

The actual interpreter is not much of a bottleneck. There are two other 
major obstacles:

1) Things may not be what they seem

2) Insufficient control over object duplication

1) is the major impediment to compilability (look for talks/papers by 
Luke for further details and ideas about what to do about it). The basic 
issue is that at no point can you be sure that the "log" function 
calculates logarithms. It might be redefined as a side effect of the 
previous expression. This is a feature of the language as such, and it 
is difficult to change without destroying features that people actually 
use. The upshot is that every time we see an object name, we enter a 
search along the current search path to find it's current binding.

2) is a little contentious: It is not certain how much we gain by 
attacking it, only that it would be a heck of a lot of work. The issue 
is that we do not use reference counting like e.g. Java or Tcl does. We 
use a primitive counter called NAMED which can be 0,1, or 2, and only 
counts upwards. When it reaches 2, destructive modification is 
disallowed and the object must be copied. I.e. consider

x <- rnorm(1e6)
y <- x

at this point we actually have x and y referring to the same ~8MB block 
of memory. However, the semantics of R is that this is a virtual copy, 
so y[1] <- 1 or x[1] <- 1 entails that we duplicate the object. Fair 
enough, if an object is bound to multiple names, we can not modify it in 
place; the problem is that we lose track when the references go away, 
and thus,

y <- x
y[1] <- 1
x[1] <- 1

causes TWO duplications. The really nasty bit is that we very often get 
objects temporarily bound to two names (think about what happens with 
arguments in function calls).

Unfortunately, we cannot base the memory management purely on reference 
counting. And of course, doing so, even partially, implies that we need 
to have a much more concrete approach to the unbinding of objects. 
Notice, for instance that the names used in a function evaluation frame 
are not guaranteed to be unbind-able when the function exits. Something 
might have saved the evaluation environment, e.g. using e <<- 
environment() but there are also more subtle methods.

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907