[R-SIG-Finance] data.table is on CRAN (higher speed time series joins and more)

Matthew Dowle mdowle at mdowle.plus.com
Fri Apr 3 02:17:36 CEST 2009


Dear all on r-sig-finance,

Since r-help is deluged, and you may not be subscribed to r-pkgs, then you
may have missed this recent post about data.table which has been on CRAN
since summer 2008 :

    http://tolstoy.newcastle.edu.au/R/packages/09/0538.html

A financial data example to add to that ... 190 million rows in a data.table
(all non-US closing daily stock prices ever since 1986) takes 3GB ram in a 3
column data.table (symbol, date, price). Vector scan DT[id=="VOD"] takes 81
seconds, binary search DT[J("VOD")] takes 0.002 seconds instead. Including
US stock prices (so all of world in a 500 million row table) still takes
under 0.01 seconds since its O(log n) search time. Thats on 64bit with
enough ram to hold the data in memory, which is commonly available these
days. But if you're still on 32bit, 3GB almost fits so some filtering by
region can be done first for example if 32bit really is the only option.
This is just an example anyway to illustrate. Note that its not the memory
footprint that is more efficient (its the same as a properly used
data.frame) but the query methods.

The package may give programming time benefits even if you don't have
compute
time or memory issues. For example less code is required than lengthy
statements involving select, from and where keywords in SQL (which might be
in strings paste'd together and sent with sqlQuery), or R code peppered with
$'s which makes your eyes water sometimes (mine anyway) if people write lots
of code that way.

If X and Y are data.tables :
    X[Y] is a fast time series join between them
    X[Y,roll=TRUE] rolls prevailing prices forward directly.
    X[Y,rolltolast=TRUE] same as roll but doesn't roll the last price
forward
after a stock dies

No particular date class is imposed, just as long as storage.mode() is
integer.

Nothing stops you using data.table's badly (just like most things in life)
e.g. you can still do vector scans with it. But its harder than data.frames
to use badly e.g. it won't allow character row names, ever, so no chance
(famous last words) of ending up with a 10 times memory bloated data.frame -
that old chestnut.

As ever, comments and feedback appreciated.

Regards,
Matthew



More information about the R-SIG-Finance mailing list