[R] Performance tuning tips when working with wide datasets

Wed Nov 24 13:23:32 CET 2010

Dear Richard,

> Does anyone have any performance tuning tips when working with datasets that
> are extremely wide (e.g. 20,000 columns)?
The obvious one is: use matrices – and take care that they don't get converted
back to data.frames.

> In particular, I am trying to perform a merge like below:
>
> merged_data<- merge(data1, data2,
> by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine.  The
> dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25
> columns.  The dataset data2 contains annual data (only 60 observations),
> however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to
> speed it up.
>
> I have tried  a number of different things to speed things up to no avail.
> I've noticed that rbinds execute much faster using matrices than dataframes.
> However the performance improvement when using matrices (vs. data frames) on
> merges were negligible (8 hours down to 7).
which is astonishing, as merge (matrix) uses merge.default, which boils down to
merge(as.data.frame(x), as.data.frame(y), ...)

>  I tried casting my merge field
> (date) into various different data types (character, factor, date).  This
> didn't seem to have any effect. I tried the hash package, however, merge
> couldn't coerce the class into a data.frame.  I've tried various ways to
> parellelize computation in the past, and found that to be problematic for a
> variety of reasons (runaway forked processes, doesn't run in a GUI
> environment, doesn't run on Macs, etc.).
>
> I'm starting to run out of ideas, anyone?  Merging a 60 row dataset shouldn't
> take that long.

Do I understand correctly that the result should be a 20000 x 20025 matrix,
where the additional 25 columns are from data2 and end up in the rows of e.g.
every 1st of January?

In that case, you may be much faster producing tmp <- matrix (NA, 20000, 20000),
fill the values of data2 into the correct rows, and then cbind data1 and tmp.
Make sure you have enough RAM available: tmp is about 1.5 GB. If you manage to
do this without swapping, it should be reasonably fast.

If you end up writing a proper merge function for matrics, please let me know:
I'd be interested in using it...

Claudia

> Thanks, Richard ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide commented, minimal,
> self-contained, reproducible code.

-- 
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbeleites at units.it