[R] Performance tuning tips when working with wide datasets
Matthew Dowle
mdowle at mdowle.plus.com
Wed Nov 24 17:15:37 CET 2010
Richard,
Try data.table. See the introduction vignette and the
presentations e.g. there is a slide showing a join to
183,000,000 observations of daily stock prices in
0.002 seconds.
data.table has fast rolling joins (i.e. fast last observation
carried forward) too. I see you asked about that on
this list on 8 Nov. Also see fast aggregations using 'by'
on a key()-ed in-memory table.
I wonder if your 20,000 columns are always
populated for all rows. If not then consider collapsing
to a 3 column table (row,col,data) and then
joining to that. You may have that format in your
original data source anyway, so you may be able
to skip a step you may have implemented already
which expands that format to wide. In other words,
keeping it narrow may be an option (like how a sparse
matrix is stored).
Matthew
http://datatable.r-forge.r-project.org/
"Richard Vlasimsky" <richard.vlasimsky at imidex.com> wrote in message
news:2E042129-4430-4C66-9308-A36B761EBBEB at imidex.com...
>
> Does anyone have any performance tuning tips when working with datasets
> that are extremely wide (e.g. 20,000 columns)?
>
> In particular, I am trying to perform a merge like below:
>
> merged_data <- merge(data1, data2,
> by.x="date",by.y="date",all=TRUE,sort=TRUE);
>
> This statement takes about 8 hours to execute on a pretty fast machine.
> The dataset data1 contains daily data going back to 1950 (20,000 rows) and
> has 25 columns. The dataset data2 contains annual data (only 60
> observations), however there are lots of columns (20,000 of them).
>
> I have to do a lot of these kinds of merges so need to figure out a way to
> speed it up.
>
> I have tried a number of different things to speed things up to no avail.
> I've noticed that rbinds execute much faster using matrices than
> dataframes. However the performance improvement when using matrices (vs.
> data frames) on merges were negligible (8 hours down to 7). I tried
> casting my merge field (date) into various different data types
> (character, factor, date). This didn't seem to have any effect. I tried
> the hash package, however, merge couldn't coerce the class into a
> data.frame. I've tried various ways to parellelize computation in the
> past, and found that to be problematic for a variety of reasons (runaway
> forked processes, doesn't run in a GUI environment, doesn't run on Macs,
> etc.).
>
> I'm starting to run out of ideas, anyone? Merging a 60 row dataset
> shouldn't take that long.
>
> Thanks,
> Richard
More information about the R-help
mailing list