[R] performance of do.call("rbind")

Marc Schwartz marc_schwartz at me.com
Mon Jun 27 19:05:40 CEST 2016


Hi,

Just to add my tuppence, which might not even be worth that these days...

I found the following blog post from 2013, which is likely dated to some extent, but provided some benchmarks for a few methods:

  http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html

There is also a comment with a reference there to using the data.table package, which I don't use, but may be something to evaluate.

As Bert and Sarah hinted at, there is overhead in taking the repetitive piecemeal approach.

If all of your data frames are of the exact same column structure (column order, column types), it may be prudent to do your own pre-allocation of a data frame that is the target row total size and then "insert" each "sub" data frame by using row indexing into the target structure.

Regards,

Marc Schwartz


> On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewolski at gmail.com> wrote:
> 
> Hi Bert,
> 
> You are most likely right. I just thought that do.call("rbind", is
> somehow more clever and allocates the memory up front. My error. After
> more searching I did find rbind.fill from plyr which seems to do the
> job (it computes the size of the result data.frame and allocates it
> first).
> 
> best
> 
> On 27 June 2016 at 18:49, Bert Gunter <bgunter.4567 at gmail.com> wrote:
>> The following might be nonsense, as I have no understanding of R
>> internals; but ....
>> 
>> "Growing" structures in R by iteratively adding new pieces is often
>> warned to be inefficient when the number of iterations is large, and
>> your rbind() invocation might fall under this rubric. If so, you might
>> try  issuing the call say, 20 times, over 10k disjoint subsets of the
>> list, and then rbinding up the 20 large frames.
>> 
>> Again, caveat emptor.
>> 
>> Cheers,
>> Bert
>> 
>> 
>> Bert Gunter
>> 
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> 
>> 
>> On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewolski at gmail.com> wrote:
>>> I have a list (variable name data.list) with approx 200k data.frames
>>> with dim(data.frame) approx 100x3.
>>> 
>>> a call
>>> 
>>> data <-do.call("rbind", data.list)
>>> 
>>> does not complete - run time is prohibitive (I killed the rsession
>>> after 5 minutes).
>>> 
>>> I would think that merging data.frame's is a common operation. Is
>>> there a better function (more performant) that I could use?
>>> 
>>> Thank you.
>>> Witold
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Witold Eryk Wolski
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
> -- 
> Witold Eryk Wolski
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list