[R] problem with do.call or how to speed code avoiding for() loops [SUMMARY]

Fri Apr 12 18:56:29 CEST 2002

Hi

These is the summary of the discussion about do.call posted on Wed,
2002-04-10 at 13:00, by Ernesto Jardim.

The initial problem was about the use of do.call function. The purpose
was to avoid for() loops and speed up code.

Regarding do.call it was referred by Peter Dalgaard that do.call is for
"situations where the argument list of a single call needs to be
constructed from simpler components".

Also Peter said that, to loop over paralel vectors something like lapply
should be used and presented a napply function example.

Thomas Lumley raised the problem that for() loops should only be avoided
if one is using vectorised functions and explained what it means (see
message bellow).

Ernesto Jardim questioned the fact that, if the family of apply
functions are writean entirly in R, then these functions would only be
usefull for simplicity in writing code.

Several messages referred that apply is only R code. Douglas Bates said
that "S Programming" discuss the need to profile the code before
implementing changes, if one wants to make it faster. Thomas referred to
paralel processing and the increase in speed that it will bring to
apply() when it will be implemented.

Prof. Ripley answered this issue saying that apply() just streamlines a
for() loop but lapply() is faster (it makes a call to compiled code) and
its use is encouradge. Also stating that apply() is a matter of style.

All relevant messages are pasted bellow. If something is wrong in this
summary please let me know and I'll correct it.

Regards

EJ   

---------------------------------------
Starting message: On Wed, 2002-04-10 at 13:00, Ernesto Jardim wrote:

Hi

I'm writing a function that uses four parameters (scalars) and I need to
run it in an iterative process (the parameters vary to find the minimum
RSS). 

I don't want to use loops and so tried the do.call function. However it
didn't work. My understanding is that the do.call simple runs the
function replacing the arguments (scalars by vectors), instead of runing
the function for each set of scalars in the list, what I need.

Can you please tell me if there is another way of doing it whithout
using the for loop ?

Thanks

EJ

ps: Follows an example (off course the example doesn't make much sense
but describes the problem).

> fun
function(a,b){

        vec <- rnorm(25)
        res <- a*vec^b
        res

}
> fun(2,3)
 [1]  7.006278e+00  3.515010e-01  7.989718e+00 -3.377766e-02
-1.879471e-02
 [6] -2.920680e-01  1.174834e+00 -1.088638e-03  6.448725e+00 
2.591805e+00
[11] -4.313672e-04 -9.171867e-03 -6.793569e+00 -2.480562e+01
-1.514828e+01
[16] -1.259896e-01 -7.504192e-02  6.647855e-02  5.609645e-01 
1.093114e-01
[21]  1.802123e+00  7.650033e-03 -3.534951e+00 -2.028473e-03
-2.837360e+01
> do.call("fun",list(a=c(1:6),b=rnorm(6)))
 [1]  1.4766338        NaN  3.0214852  3.8132530  0.2753699        NaN
 [7]        NaN        NaN  2.9998547        NaN        NaN  6.3050385
[13]  0.5970596  0.8722498  2.9931344  4.0664852        NaN        NaN
[19]  2.8121803        NaN  2.9989127        NaN        NaN        NaN
[25] 14.4631627
Warning messages: 
1: longer object length
        is not a multiple of shorter object length in: vec^b 
2: longer object length
        is not a multiple of shorter object length in: a * vec^b 
> 

---------------------------------------
Peter Dalgaard:

That's not what do.call does. It is for situations where the argument
list of a single call needs to be constructed from simpler components.
Your example is equivalent to fun(a=c(1:6), b=rnorm(6))

The loop over multiple parallel vectors is only doable via something
like lapply(1:6, function(i)fun(a[i],b[i]))

However, I recently played with this and got as far as this:

napply <-
function(..., FUN) {
   FUN <- match.fun(FUN)
   x <- list(...)
   lens <- sapply(x,length)
   len <- max(lens)
   if (any(lens != len))
      x <- lapply(x, rep, length=len)
   tuples <- lapply(seq(length=len), function(i)lapply(x,"[", i))
   sapply(tuples, function(t)eval(as.call(c(FUN,t))))
}

>  napply(a=c(1:6),b=rnorm(6), FUN=fun)
           [,1]     [,2]     [,3]       [,4]     [,5]        [,6]
 [1,] 1.0259135      NaN 3.003882        NaN      NaN   20.299212
 [2,]       NaN 1.977696 3.026111        NaN 3.951746   19.107481
 [3,] 1.1840499 2.024837      NaN   8.289768      NaN    7.479917
 [4,] 0.9756922 2.003576      NaN        NaN 4.236000         NaN
 [5,] 1.0010550 2.006045      NaN        NaN      NaN 1302.330425
 [6,]       NaN      NaN      NaN   2.472650      NaN         NaN
 [7,]       NaN 2.094956      NaN        NaN      NaN    3.685879
 [8,] 0.8646628      NaN 2.993435        NaN 3.369501         NaN
 [9,]       NaN 2.044915 3.006433   6.426090 6.123980   19.235790
[10,] 1.6051736      NaN 3.011986        NaN 3.638641         NaN
....

> > fun
> function(a,b){
> 
>         vec <- rnorm(25)
>         res <- a*vec^b
>         res
> 
> }
> > fun(2,3)
>  [1]  7.006278e+00  3.515010e-01  7.989718e+00 -3.377766e-02
> -1.879471e-02
>  [6] -2.920680e-01  1.174834e+00 -1.088638e-03  6.448725e+00 
> 2.591805e+00
> [11] -4.313672e-04 -9.171867e-03 -6.793569e+00 -2.480562e+01
> -1.514828e+01
> [16] -1.259896e-01 -7.504192e-02  6.647855e-02  5.609645e-01 
> 1.093114e-01
> [21]  1.802123e+00  7.650033e-03 -3.534951e+00 -2.028473e-03
> -2.837360e+01
> > do.call("fun",list(a=c(1:6),b=rnorm(6)))
>  [1]  1.4766338        NaN  3.0214852  3.8132530  0.2753699        NaN
>  [7]        NaN        NaN  2.9998547        NaN        NaN  6.3050385
> [13]  0.5970596  0.8722498  2.9931344  4.0664852        NaN        NaN
> [19]  2.8121803        NaN  2.9989127        NaN        NaN        NaN
> [25] 14.4631627
> Warning messages: 
> 1: longer object length
>         is not a multiple of shorter object length in: vec^b 
> 2: longer object length
>         is not a multiple of shorter object length in: a * vec^b 
> > 

---------------------------------------
Thomas Lumley:

And why wouldn't you want to use the for() loop?  Unless your function
is vectorised you're not going to gain anything by getting rid of the
for() loop.

Definition of vectorised function by Thomas:

Many R functions can operate on a vector of parameter values, eg

log(10,c(2,e,10)) gives the log of 10 to base 2, e, and 10

If your function can do this, you can construct a set of vectors
containing all your parameter values (expand.grid() is useful for this)
and evaluate your function once.

This can be faster than for() loops when much of the iteration is done
in compiled code. If the iteration has to be done in interpreted code
then you can't really speed up the for() loops.  You can hide the loops
with the apply() functions, which may make your code more readable, but
it won't typically speed it up.

---------------------------------------
Ernesto Jardim:

This was not my understanding. I thougth that if you can use functions
like apply and similar instead of for loops your code will be faster.
Basicly relying on these functions code which is (should be) optimized
for speed.

If what you're saying is true then using functions like apply is a
matter of simplicity and not speeding up the code. 

Is this correct ?

---------------------------------------
Douglas Bates:

Yes.

If you examine the apply function you will see that the bulk of the
work is done in a loop

    if (length(d.call) < 2) {
        if (length(dn.call)) 
            dimnames(newX) <- c(dn.call, list(NULL))
        for (i in 1:d2) ans[[i]] <- FUN(newX[, i], ...)
    }
    else for (i in 1:d2) ans[[i]] <- FUN(array(newX[, i], d.call, 
        dn.call), ...)

In their book "S Programming" (Springer, 2000) Venables and Ripley
discuss general strategies for writing R functions and for making them
faster.  One general principle is to profile the code before
implementing changes.  The manual "Writing R Extensions" has a section
on "Profiling R code" which is highly recommended.

---------------------------------------
Thomas Lumley:

Yes. As you can easily verify [and always should verify if you're doing
optimisation], the apply commands are rarely faster than their for()
loop equivalents. They can be slower.
The speed advantage of apply is partly mythical -- there's never been
that much advantage -- and partly historical, as in some versions of
S-PLUS 3.x apply was often faster for complicated reasons due to memory
management.

The real point of the apply() family is to suppress unnecessary loop
variables and make your code tidier.  If we ever get parallel processing
then apply() could really become faster, but that's not going to happen
any time soon.

---------------------------------------
Brian Ripley:

I think that is a little pessimistic. It is true for apply() in R, which
just streamlines a for() loop, and also does things you may not want.
However, lapply is an internal function (written by me) because it is
sometimes a lot faster, and in my experiments never slower.

lapply() was a lot faster in S-PLUS 3.4.  It was often slower than for()
in 5.0, hence a lot of consternation.   There *are* a lot of myths
about,
but not all in one direction.  As others have said, `S Programming'
tries
to give a balanced view across 3 different S implementations, and
profiling can be a great tool in optimizing code (it can be misleading
too, but rarely when it matters).

Summary: lapply is encouraged.  apply is a matter of style.  Test out
whatever you do to see if it is really worthwhile.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._