[R-SIG-Finance] xts + indexing for tick data?

Thu Aug 5 21:09:45 CEST 2010

I'll answer some of the 'background' on this, though point out that it
is still very 'alpha' (interface, backend and even functions may
change - though always with the same end goal: handle external data as
fast and memory friendly as possible)

On Thu, Aug 5, 2010 at 12:40 PM, Andrew Piskorski <atp at piskorski.com> wrote:

>
> Interesting!  Could you say a bit more though about how you've
> actually used xts and indexing in practice, and any pointers for
> getting started trying them out for tick data?

The general design is something like a data.frame, but implemented as
an environment.  The internal structure shouldn't be a concern though,
just note that the interface attempts to behave like a data.frame.

Due note that a data.frame isn't much more than a fancy list in R.
Which is in turn a collection of atomic vectors, which are in turn
bytes in memory plus a bit of an R header.

> .Internal(inspect(data.frame(a=1)))
@10224c868 19 VECSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=1)
  @10224c838 14 REALSXP g0c1 [] (len=1, tl=1) 1
ATTRIB:
  @102439390 02 LISTSXP g0c0 []
    TAG: @100843e78 01 SYMSXP g0c0 [MARK,NAM(2),gp=0x4000] "names"
    @10224c808 16 STRSXP g0c1 [] (len=1, tl=1)
      @1009a2428 09 CHARSXP g0c1 [MARK,gp=0x21] "a"
    TAG: @100861860 01 SYMSXP g0c0 [MARK,NAM(2),gp=0x4000] "row.names"
    @100f78b58 13 INTSXP g0c1 [] (len=2, tl=1) -2147483648,-1
    TAG: @100844348 01 SYMSXP g0c0 [MARK,gp=0x4000] "class"
    @100f78b28 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
      @1008c8ad8 09 CHARSXP g0c2 [MARK,gp=0x21,ATT] "data.frame"
      ATTRIB:
	@1008752e8 09 CHARSXP g0c2 [MARK,gp=0x21] "is.loaded"

> .Internal(inspect(list(a=1)))
@100f78ac8 19 VECSXP g0c1 [ATT] (len=1, tl=0)
  @100f78a68 14 REALSXP g0c1 [] (len=1, tl=1) 1
ATTRIB:
  @10243e558 02 LISTSXP g0c0 []
    TAG: @100843e78 01 SYMSXP g0c0 [MARK,NAM(2),gp=0x4000] "names"
    @100f78a98 16 STRSXP g0c1 [] (len=1, tl=1)
      @1009a2428 09 CHARSXP g0c1 [MARK,gp=0x21] "a"

First 8 lines of each tell you what the data.frame and list 'looks
like' ... more or less identical.

mmap takes the 'bytes' that are the data and allows them to reside on
disk.  The object itself isn't much more than a pointer (well, it is
-- but for usage you can ignore most of the other 'object' stuff.)

You need to extract elements from this object (mmap) to get them into
R as R objects.  This makes sense since you are using mmap to avoid
having the object in memory -- so you need to think in "chunks from
disk".

> m <- as.mmap(1:100)
> m
<mmap:/var/folde...>  (int) int [1:100] 1 2 3 4 5 6 ...
> m[1]
[1] 1
> m[1:4]
[1] 1 2 3 4

Nothing too interesting so far... but that is what is "inside" the
indexing black box.

Single vectors aren't overly useful for financial data of course, and
the design of this stems from an options database of symbol names,
prices, dates, etc. (eod, but 70mm rows and 19 columns at the
beginning), so vectors are just part of the solution.

The real meat comes from the indexing package.  This is where you can
think back to a data.frame - a collection of columns (albeit with
columns that aren't residing next to each other necessarily).

indexing lets you 'index' the columns with a variety of means, and
therefore makes lookups very, very fast.  O(log n) or thereabouts.  I
am not a comp-sci guy, I am a guy with a data problem ;-) .. so don't
kill me on the nomenclature abuse.

In essence, you can take any type of data that mmap can use and put it
into this indexed environment/db. The 'magic', if there is any, is
that you can now extract data from your column-oriented "database"
using basic R semantics.

Stuff like:

db[a == 100,  a * 9]

db[symbols== "AAPL" & expiry == "2009-02-20", data.frame(osi, bid, ask, dates)]

The later returns a data.frame, which can be further subset/ordered/etc.

All of the searching is using the optimized index internals
automatically, all without using (much) memory.  In fact, due to OS
caching etc, it is pretty trivial to have any number of SMP procs use
the same DB.  The non-R structure of the data on disk (memory) also
makes it easy to share with non-R procs (untested as of yet).  Many
many GB or bigger is easy even on a laptop.

My final comment is that this is really bleeding edge stuff.  Almost
to the point where you'd need access to my laptop/workstation/server
to really have the latest and greatest version.  Comments and feedback
is greatly appreciated, and I will keep pushing this to get it
'production' (i.e. CRAN) ready, but for now think of it as a good idea
(?) that hasn't yet been fully vetted. (though it _is_ being used by a
few...)

Here is a link to my talk at useR! 2010. The video part of the slides
are links now to youtube videos.

http://www.insightalgo.com/indexing_mmap.pdf

Best,
Jeff

>
> I grabbed the latest code (below) from svn, but although the docs talk
> about data frames, so far it looks to me like indexing and mmap only
> support atomic vectors.  Does that sound right, or am I looking in the
> wrong places?
>
> Also, the docs talk about how to use mmap's struct() to build up a
> row-oriented data store, but, wouldn't a column-oriented store be more
> natural for many uses in R, particularly for large time series of tick
> data?
>
> I guess I could mmap each column of a data frame (or xts object) to a
> separate file, and then stick any necessary additional metadata for
> the object in some ancillary file, but...  What are your thoughts on
> how to do that right?
>
> (Thanks for your advice!)
>
> ------------------------------
>
> R 2.11.1 (Patched), 2010-07-27, svn.rev 52627, x86_64-unknown-linux-gnu
>> require("mmap") ; require("indexing") ; require("xts")
>> data(sample_matrix)
>> sample.xts <- as.xts(sample_matrix, descr='my new xts object')
>> create_index(sample.xts)
> Error in UseMethod("create_index.mmap") :
>  no applicable method for 'create_index.mmap' applied to an object of class "c('xts', 'zoo')"
>
> $ svn info mmap indexing xts | egrep '^(URL|Revision|Last Changed Rev|Last Changed Date):'
> URL: svn://svn.r-forge.r-project.org/svnroot/indexing/pkg/mmap
> Revision: 94
> Last Changed Rev: 94
> Last Changed Date: 2010-06-28 15:13:29 -0400 (Mon, 28 Jun 2010)
> URL: svn://svn.r-forge.r-project.org/svnroot/indexing/pkg/indexing
> Revision: 94
> Last Changed Rev: 87
> Last Changed Date: 2010-05-19 14:16:43 -0400 (Wed, 19 May 2010)
> URL: svn://svn.r-forge.r-project.org/svnroot/xts/pkg
> Revision: 506
> Last Changed Rev: 506
> Last Changed Date: 2010-08-04 14:28:48 -0400 (Wed, 04 Aug 2010)
>
> --
> Andrew Piskorski <atp at piskorski.com>
> http://www.piskorski.com/
>

-- 
Jeffrey Ryan
jeffrey.ryan at insightalgo.com

ia: insight algorithmics
www.insightalgo.com