[R] Please explain "do.call" in this context, or critique to "stack this list faster"

Sat Sep 4 20:49:16 CEST 2010

On 09/04/2010 01:37 PM, Paul Johnson wrote:
> I've been doing some consulting with students who seem to come to R
> from SAS.  They are usually pre-occupied with do loops and it is tough
> to persuade them to trust R lists rather than keeping 100s of named
> matrices floating around.
>
> Often it happens that there is a list with lots of matrices or data
> frames in it and we need to "stack those together".  I thought it
> would be a simple thing, but it turns out there are several ways to
> get it done, and in this case, the most "elegant" way using do.call is
> not the fastest, but it does appear to be the least prone to
> programmer error.
>
> I have been staring at ?do.call for quite a while and I have to admit
> that I just need some more explanations in order to interpret it.  I
> can't really get why this does work
>
> do.call( "rbind", mylist)

do.call is *constructing* a function call from the list of arguments,
my.list.

It is shorthand for

rbind(mylist[[1]], mylist[[2]], mylist[[3]]) assuming mylist has
3 elements.

>
> but it does not work to do
>
> sapply ( mylist, rbind).

That's because sapply is calling rbind once for each item
in mylist, not what you want to do to accomplish your goal.

It might help to use a debugging technique to watch when
rbind gets called, and see how many times it gets called
and with what arguments using those two approaches.

>
> Anyway, here's the self contained working example that compares the
> speed of various approaches.  If you send yet more ways to do this, I
> will add them on and then post the result to my Working Example
> collection.
>
> ## stackMerge.R
> ## Paul Johnson<pauljohn at ku.edu>
> ## 2010-09-02
>
>
> ## rbind is neat,but how to do it to a lot of
> ## data frames?
>
> ## Here is a test case
>
> df1<- data.frame(x=rnorm(100),y=rnorm(100))
> df2<- data.frame(x=rnorm(100),y=rnorm(100))
> df3<- data.frame(x=rnorm(100),y=rnorm(100))
> df4<- data.frame(x=rnorm(100),y=rnorm(100))
>
> mylist<-  list(df1, df2, df3, df4)
>
> ## Usually we have done a stupid
> ## loop  to get this done
>
> resultDF<- mylist[[1]]
> for (i in 2:4) resultDF<- rbind(resultDF, mylist[[i]])
>
> ## My intuition was that this should work:
> ## lapply( mylist, rbind )
> ## but no! It just makes a new list
>
> ## This obliterates the columns
> ## unlist( mylist )
>
> ## I got this idea from code in the
> ## "complete" function in the "mice" package
> ## It uses brute force to allocate a big matrix of 0's and
> ## then it places the individual data frames into that matrix.
>
> m<- 4
> nr<- nrow(df1)
> nc<- ncol(df1)
> dataComplete<- as.data.frame(matrix(0, nrow = nr*m, ncol = nc))
> for (j in  1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ]<- mylist[[j]]
>
>
>
> ## I searched a long time for an answer that looked better.
> ## This website is helpful:
> ## http://stackoverflow.com/questions/tagged/r
> ## I started to type in the question and 3 plausible answers
> ## popped up before I could finish.
>
> ## The terse answer is:
> shortAnswer<- do.call("rbind",mylist)
>
> ## That's the right answer, see:
>
> shortAnswer == dataComplete
> ## But I don't understand why it works.
>
> ## More importantly, I don't know if it is fastest, or best.
> ## It is certainly less error prone than "dataComplete"
>
> ## First, make a bigger test case and use system.time to evaluate
>
> phony<- function(i){
>    data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000))
> }
> mylist<- lapply(1:1000, phony)
>
>
> ### First, try the terse way
> system.time( shortAnswer<- do.call("rbind", mylist) )
>
>
> ### Second, try the complete way:
> m<- 1000
> nr<- nrow(df1)
> nc<- ncol(df1)
>
> system.time(
>     dataComplete<- as.data.frame(matrix(0, nrow = nr*m, ncol = nc))
>   )
>
> system.time(
>     for (j in  1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ]<- mylist[[j]]
> )
>
>
> ## On my Thinkpad T62 dual core, the "shortAnswer" approach takes about
> ## three times as long:
>
>
> ##>  system.time( bestAnswer<- do.call("rbind",mylist) )
> ##    user  system elapsed
> ##  14.270   1.170  15.433
>
> ##>  system.time(
> ## +    dataComplete<- as.data.frame(matrix(0, nrow = nr*m, ncol = nc))
> ## +  )
> ##    user  system elapsed
> ##   0.000   0.000   0.006
>
> ##>  system.time(
> ## + for (j in  1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ]<- mylist[[j]]
> ## + )
> ##    user  system elapsed
> ##   4.940   0.050   4.989
>
>
> ## That makes the do.call way look slow, and I said "hey,
> ## our stupid for loop at the beginning may not be so bad.
> ## Wrong. It is a disaster.  Check this out:
>
>
> ##>  resultDF<- phony(1)
> ##>  system.time(
> ## + for (i in 2:1000) resultDF<- rbind(resultDF, mylist[[i]])
> ## +    )
> ##    user  system elapsed
> ## 159.740   4.150 163.996
>
>