[R] fast or space-efficient lookup?

Sun Oct 9 18:31:08 CEST 2011

Dear R experts---I am struggling with memory and speed issues.  Advice
would be appreciated.

I have a long data set (of financial stock returns, with stock name
and trading day).  All three variables, stock return, id and day, are
irregular.  About 1.3GB in object.size (200MB on disk).  now, I need
to merge the main data set with some aggregate data (e.g., the S&P500
market rate of return, with a day index) from the same day.  this
"market data set" is not a big data set (object.size=300K, 5 columns,
12000 rows).

let's say my (dumb statistical) plan is to run one grand regression,
where the individual rate of return is y and the market rate of return
is x.  the following should work without a problem:

combined <- merge( main, aggregate.data, by="day", all.x=TRUE, all.y=FALSE )
lm( stockreturn ~ marketreturn, data=combined )

alas, the merge is neither space-efficient nor fast.  in fact, I run
out of memory on my 16GB linux machine.  my guess is that by whittling
it down, I could work it (perhaps doing it in chunks, and then
rbinding it), but this is painful.

in perl, I would define a hash with the day as key and the market
return as value, and then loop over the main data set to supplement
it.

is there a recommended way of doing such tasks in R, either super-fast
(so that I merge many many times) or space efficient (so that I merge
once and store the results)?

sincerely,

/iaw

----
Ivo Welch (ivo.welch at gmail.com)