[R] R badly lags matlab on performance?
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Sun Jan 4 23:31:03 CET 2009
Stavros Macrakis wrote:
> On Sat, Jan 3, 2009 at 7:02 PM, <luke at stat.uiowa.edu> wrote:
>> R's interpreter is fairly slow due in large part to the allocation of
>> argument lists and the cost of lookups of variables, including ones
>> like [<- that are assembled and looked up as strings on every call.
>
> Wow, I had no idea the interpreter was so awful. Just some simple
> tree-to-tree transformations would speed things up, I'd think, e.g.
> `<-`(`[`(...), ...) ==> `<-[`(...,...).
Doesn't really help (and it's not quite correct: a[2] <- 1 is equivalent to
a <- `[<-`(a, 2, 1)
with some sneakiness that assumes that the two a's are the same, so that
you might destructively modify the second instance.)
The actual interpreter is not much of a bottleneck. There are two other
major obstacles:
1) Things may not be what they seem
2) Insufficient control over object duplication
1) is the major impediment to compilability (look for talks/papers by
Luke for further details and ideas about what to do about it). The basic
issue is that at no point can you be sure that the "log" function
calculates logarithms. It might be redefined as a side effect of the
previous expression. This is a feature of the language as such, and it
is difficult to change without destroying features that people actually
use. The upshot is that every time we see an object name, we enter a
search along the current search path to find it's current binding.
2) is a little contentious: It is not certain how much we gain by
attacking it, only that it would be a heck of a lot of work. The issue
is that we do not use reference counting like e.g. Java or Tcl does. We
use a primitive counter called NAMED which can be 0,1, or 2, and only
counts upwards. When it reaches 2, destructive modification is
disallowed and the object must be copied. I.e. consider
x <- rnorm(1e6)
y <- x
at this point we actually have x and y referring to the same ~8MB block
of memory. However, the semantics of R is that this is a virtual copy,
so y[1] <- 1 or x[1] <- 1 entails that we duplicate the object. Fair
enough, if an object is bound to multiple names, we can not modify it in
place; the problem is that we lose track when the references go away,
and thus,
y <- x
y[1] <- 1
x[1] <- 1
causes TWO duplications. The really nasty bit is that we very often get
objects temporarily bound to two names (think about what happens with
arguments in function calls).
Unfortunately, we cannot base the memory management purely on reference
counting. And of course, doing so, even partially, implies that we need
to have a much more concrete approach to the unbinding of objects.
Notice, for instance that the names used in a function evaluation frame
are not guaranteed to be unbind-able when the function exits. Something
might have saved the evaluation environment, e.g. using e <<-
environment() but there are also more subtle methods.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list