[R-SIG-Finance] R + HDF5 + Pytables

jeff.a.ryan at gmail.com jeff.a.ryan at gmail.com
Tue May 18 18:04:41 CEST 2010


Yes, indexing is my answer to an out-of-core data solution. 

It is essentially a data.frame on disk, where OS level mmap is used to manage the process efficiently and transparently. 

It is under very active development and really is quite far from "stable", though it is functional and being used internally on data that is fairly large.

The parts that make it fast are the facts that is column oriented (like data.frames) and that traditional indexing tools are available by default. Currently sorted indexing is implemented, but bitmap variants are in the works, as well as compression tools for bitmaps. The LZO algorithm is part of the development as well as more high performance variants related to some advanced compression schemes that allow for relational algebra on the compressed bitmaps. 

As Daniel stated though this isn't really 'finance' per se, so I'll stop here. 

When the progress is further along, I will make announcements to the list(s). I am also presenting this at useR in DC this summer. 

Benchmarks against the Kdb's of the world would indeed be fun. I don't think they allow that... I wonder why? ;-)

Jeff
Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Daniel Cegiełka <daniel.cegielka at gmail.com>
Date: Tue, 18 May 2010 16:23:36 
To: Manoj<manojsw at gmail.com>
Cc: <r-sig-finance at stat.math.ethz.ch>
Subject: Re: [R-SIG-Finance] R + HDF5 + Pytables

Manoj, this is not a financial subject  - you should send this to
r-sig-hpc list.

> Hopefully we could do a comparision/benchmarking of few different
> alternatives (including commercial tools like kdb).

Now indexing is still under development, but ability to work with high
performance with TB of tick data it was one of primary design goal of
indexing package. Inside xts code you can find nice optimized C code
for low latency and high performance. And when you join xts with
indexing package you can compare it even with kdb... (next point - you
can use indexing as a shared memory for many R instances).

Indexing will work nice event with many TB of tick data and you don't
have latency from TCP stack (kdb).

It need(?) only some nice compression solution...

regards,
daniel


W dniu 18 maja 2010 05:56 użytkownik Manoj <manojsw at gmail.com> napisał:
> Daniel - that's interesting feedback.
>
> Jeff: I did a quick search on indexing packages and it seems its still
> in development stages - looks very promising thou. I am more than
> happy to test it out and give feedback/suggestions.
>
> Hopefully we could do a comparision/benchmarking of few different
> alternatives (including commercial tools like kdb).
>
> Manoj
>
> 2010/5/18 Daniel Cegiełka <daniel.cegielka at gmail.com>:
>> Hi Monoj
>> I tested hdf5 with R and in my opinion there is no sense to use it
>> with xts/zoo for tick data.
>> If you will work with R, then much better is to store xts objects (or
>> R objects) directly on the disk (it's simpler, faster and better way).
>>
>> Check (Jeff Ryan) packages:
>> RBerkeley: https://r-forge.r-project.org/projects/rberkeley/
>> indexing: http://r-forge.r-project.org/projects/indexing/
>>
>> example for RBerkeley:
>>
>> bdb <- db_create()
>> db_open(bdb,file='blotter.db')   # load db_file from disc
>>
>> # and some quary
>> unserialize(db_get(dbh,key='GOOG'))['2010-02-17::2010-02-25',4])
>>
>>
>> If you need ultra fast solution, you must try Jeff's indexing package ;)
>>
>> regards,
>> daniel
>>
>>
>>
>>
>> 2010/5/17 Manoj <manojsw at gmail.com>
>>>
>>> Dear All,
>>>       I have created a HDF5 file using Python + Pytables. The HDF5
>>> file stores tick-data and as such is quite huge in size. I am planning
>>> to use R/zoo/xts combination for analytics. The tricky bit is that I
>>> am unable to find a good wrapper to access/query the HDF5 created by
>>> Pytables (keeping intact all the nice features such as indices etc of
>>> HDF5 file) .  The hdf5 library in R wouldn't help given the size of
>>> the file.
>>>
>>>      One (crude) option is to query data using Python/Pytables, write
>>> to an output file and invoke R for analytics. The question is - could
>>> this task be done in a more efficient fashion? Is there a good
>>> HDF5/Pytables wrapper that could help me do the task completely within
>>> R?
>>>
>>>     Any tips/suggestions would be greatly appreciated.
>>>
>>> Thanks.
>>>
>>> Manoj
>>>
>>> _______________________________________________
>>> R-SIG-Finance at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
>>> -- Subscriber-posting only. If you want to post, subscribe first.
>>> -- Also note that this is not the r-help list where general R questions should go.
>>
>

_______________________________________________
R-SIG-Finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
-- Subscriber-posting only. If you want to post, subscribe first.
-- Also note that this is not the r-help list where general R questions should go.


More information about the R-SIG-Finance mailing list