[R] Performance tuning tips when working with wide datasets

Gabor Grothendieck ggrothendieck at gmail.com
Wed Nov 24 13:54:57 CET 2010


On Wed, Nov 24, 2010 at 6:51 AM, Richard Vlasimsky
<richard.vlasimsky at imidex.com> wrote:
>
> Does anyone have any performance tuning tips when working with datasets that are extremely wide (e.g. 20,000 columns)?
>
> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine.  The dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 columns.  The dataset data2 contains annual data (only 60 observations), however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to speed it up.
>
> I have tried  a number of different things to speed things up to no avail.  I've noticed that rbinds execute much faster using matrices than dataframes.  However the performance improvement when using matrices (vs. data frames) on merges were negligible (8 hours down to 7).  I tried casting my merge field (date) into various different data types (character, factor, date).  This didn't seem to have any effect. I tried the hash package, however, merge couldn't coerce the class into a data.frame.  I've tried various ways to parellelize computation in the past, and found that to be problematic for a variety of reasons (runaway forked processes, doesn't run in a GUI environment, doesn't run on Macs, etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset shouldn't take that long.
>

I don't know for sure if its faster but you could try using sqldf
since it won't rely on R facilities to do the join.  See the examples
here::

http://code.google.com/p/sqldf/#Example_4._Join

and if that is still too slow try adding indexes as illustrated in
examples 4h and 4i there.

Its very easy to try since its just one sql statement to do a join and
one to index each table if you need that.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list