[R] Performance tuning tips when working with wide datasets

Wed Nov 24 13:34:42 CET 2010

----------------------------------------
> Date: Wed, 24 Nov 2010 13:23:32 +0100
> From: cbeleites at units.it
> To: r-help at r-project.org
> Subject: Re: [R] Performance tuning tips when working with wide datasets
>
> Dear Richard,
>
> > Does anyone have any performance tuning tips when working with datasets that
> > are extremely wide (e.g. 20,000 columns)?
> The obvious one is: use matrices – and take care that they don't get converted
> back to data.frames.
>
> > In particular, I am trying to perform a merge like below:
> >
> > merged_data<- merge(data1, data2,
> > by.x="date",by.y="date",all=TRUE,sort=TRUE);
> >
> > This statement takes about 8 hours to execute on a pretty fast machine. The
> > dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25
> > columns. The dataset data2 contains annual data (only 60 observations),
> > however there are lots of columns (20,000 of them).
> >
> > I have to do a lot of these kinds of merges so need to figure out a way to
> > speed it up.
> >
> > I have tried a number of different things to speed things up to no avail.
> > I've noticed that rbinds execute much faster using matrices than dataframes.
> > However the performance improvement when using matrices (vs. data frames) on
> > merges were negligible (8 hours down to 7).
> which is astonishing, as merge (matrix) uses merge.default, which boils down to
> merge(as.data.frame(x), as.data.frame(y), ...)
>
> > I tried casting my merge field
> > (date) into various different data types (character, factor, date). This
> > didn't seem to have any effect. I tried the hash package, however, merge
> > couldn't coerce the class into a data.frame. I've tried various ways to
> > parellelize computation in the past, and found that to be problematic for a
> > variety of reasons (runaway forked processes, doesn't run in a GUI
> > environment, doesn't run on Macs, etc.).
> >
> > I'm starting to run out of ideas, anyone? Merging a 60 row dataset shouldn't
> > take that long.
>
> Do I understand correctly that the result should be a 20000 x 20025 matrix,
> where the additional 25 columns are from data2 and end up in the rows of e.g.
> every 1st of January?
>
> In that case, you may be much faster producing tmp <- matrix (NA, 20000, 20000),
> fill the values of data2 into the correct rows, and then cbind data1 and tmp.
> Make sure you have enough RAM available: tmp is about 1.5 GB. If you manage to
> do this without swapping, it should be reasonably fast.
>
> If you end up writing a proper merge function for matrics, please let me know:
> I'd be interested in using it...

Alternatively, it may be easier to work with bash tools like awk/sed/perl
if you are working with raw data files. I would mention too that you need
to get some idea why it is slow - look at task manager for CPU usage
and page fault rates. Even VM can be ok if you keep coherence up- my
favorite example is where I used a bash "sort" before piping data
into another app. You wouldn't think that sorting a large dataset
would speed things up but it made the other app from unusably slow to
quite fast due to lack of VM thrashing. Also, depending on how you
will use your matrix, if you know your access patterns, you may want
to just stream pieces of it into memory etc.

So, first find out what computer resource is slowing down your approaches
and design a data strcuture that fits your analysis needs.

>
> Claudia
>
>
> > Thanks, Richard ______________________________________________
>
> --
> Claudia Beleites
> Dipartimento dei Materiali e delle Risorse Naturali
> Università degli Studi di Trieste
> Via Alfonso Valerio 6/a
> I-34127 Trieste
>
> phone: +39 0 40 5 58-37 68
> email: cbeleites at units.it