[R-SIG-Finance] Keeping persistent data collections

Mon Nov 7 10:20:19 CET 2011

On Sun, 2011-11-06 at 22:43 -0500, Dino Veritas wrote:
> Hello, I recently found this list and have been reading deeply the
> archives. I am wondering how people here maintain their collections of data
> for easy use in R. I am wondering a few things:
> 
> 1) How do members of this list deal with keeping persistent data
> collections with R? I was thinking of individual xts objects by asset and
> frequency (such as AAPL daily, AAPL minute, AAPL 60m, etc). While I can
> store and maintain these xts objects on disk and load them into R as
> needed, I am wondering if there is a more better solution.

I store only tick data, as I can easily get to any other frequency from
tick.  I've considered also storing daily data, but in the end I decide
it is too much trouble to (additionally) manage, and just store tick.

> 2) Coming from that, I have been looking into the indexing package for my
> needs. It seems very useful for managing a lot of large data sets in
> memory, but I am not sure it is a good method for maintaining persistent
> data, I have found trouble adding information to existing data that is
> indexed on disk. Do poster here use indexing for this purpose? I did find
> an old post or two touching on that with no specifics. I would like to be
> able to combine the ability of indexing to have many large data sets
> available in memory with persistent storage of data. Has anyone any
> experience doing this?

You are correct that the 'indexing' package is very powerful.  It is
also not done yet.  

As I said, I store tick data.  The way I do this is with single files
per day of data per symbol, pre-parsed into xts objects and stored to
disk in one directory per symbol (using 'save').    

I then use FinancialInstrument to keep track of all the instrument
metadata, and getSymbols to load the data into R when I need it (and
over the time-frames that I require).  We currently download tick data
for about 2500 tradeable instruments per day, and maintain archives
going back several years.  We have the .instrument environment stored on
the same file server as the data, and every .Rprofile in the firm points
to this so that everyone has access to getInstrument and getSymbols

I know someone who works in the hedge fund industry, mostly with monthly
data, with some daily data sprinkled in.  He uses the same approach I
have outlined of storing the metadata in FinancialInstrument, and
getSymbols to access the data.  He typically stores one consolidated CSV
file per instrument, because CSV files are easy to add on to with a
batch process.  

For lower frequency data (let's say daily or lower) a database is
certainly an option, and there are getSymbols wrappers that could be
adapted to whatever schema you decided to use. Obviously, there are tick
data database providers such as OneTick and kdb, and if you have this
problem and the resources to need this type of solution, you probably
already know that you are in this camp, and know that these providers
have R interfaces of varying quality.

The FinancialInstrument package has a 'parsers' directory included in
the 'inst' directory of the package with many examples of download and
parse routines for regular loading of data from a variety of free or
subscription providers.  This should give you a lot of material to begin
working with your own data providers.

> 3) How do people keep track of all the data sets within R? Are there any
> useful packages for keeping track of multiple sets of financial data and
> the information about them?

We wrote and use FinancialInstrument for this purpose.

As I said earlier, I see no value in storing different periodicities,
and store only tick.

One of the reasons that I chose to write a getSymbols wrapper for
retrieving our tick data stores is that resources like this list have
extensive experience about using getSymbols, and it is therefore easy
for people at our firm to become familiar with using the data. 

Also, I am reasonably confident that as the indexing package matures,
there will be a getSymbols method for it as well, and if appropriate I
can easily convert all my data in one batch pass and it will be
transparent to my users.

I made what I now realize to have been a mistake at a previous firm in
writing a data retrieval function that was not compatible with
getSymbols which was more complex to teach people how to use it, and
less compatible with huge amounts of other publicly available code.

quantmod and FinancialInstrument contain examples of various getSymbols
methods that may meet your needs, or that could serve as templates for
your custom in-house data source.

> 4) Any other pointers? I know many here are well versed and manage large
> data sets with R. Any tips you have or even simply showing me in a helpful
> direction to useful packages you use is great. This list is a great help
> for me and I am still browsing old threads!

Regards,

    - Brian

-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock