[R-SIG-Finance] high frequency data analysis in R

Jeff Ryan jeff.a.ryan at gmail.com
Thu May 21 19:25:11 CEST 2009


Haky,

My times are from a new R session on a MacBook 2.16, so yes it is fast.

On Thu, May 21, 2009 at 12:02 PM, Hae Kyung Im <hakyim at gmail.com> wrote:
> Jeff,
>
> This is very impressive. Even on my Macbook Air it takes less than 0.2
> seconds total.
>
>> x <- .xts(1:1e6, 1:1e6)
>> system.time(merge(x,x))
>   user  system elapsed
>  0.093   0.021   0.198
>


>
>> quantmod now has (devel) an attachSymbols function that makes
>> lazy-loading data very easy, so all your data can be stored as xts
>> objects and read in on-demand.
>
> When you say stored, does it mean on disk or memory?
>

attachSymbols can use disk or memory for caching, but the files are
read with getSymbols, so they can realistically be stored anywhere.
The docs provide at least a small introduction.

The tutorial I gave at R/Finance 2009 gives a small example as well.

http://www.RinFinance.com/presentations

>> xts is also getting the ability to query subsets of data on disk, by
>> time.  This will have no practical limit.
>
> This would be great! Will we be able to append data to xts stored on disk?
>

The core issue is read or write optimized.  I lean toward read
optimization, so something akin to a column-based structure.  This
will make writes more costly, but that would be acceptable to me at
the moment.  Probably keep some sort of write structure => read
structure tool in the mix as well.

I will of course keep the list updated on progress here once it is
ready for release.

>
> Thanks
> Haky
>

Thanks,
Jeff
>
>
> On Thu, May 21, 2009 at 11:23 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
>> Not to distract from the underlying processing question, but to answer
>> the 'data' one:
>>
>> The data in R should be too much of an issue, at least from a size perspective.
>>
>> xts objects on the order of millions of observations are still fast
>> and memory friendly with respect to copying operations internal to
>> many xts calls (merge, subset, etc).
>>
>>> x <- .xts(1:1e6, 1:1e6)
>>> system.time(merge(x,x))
>>   user  system elapsed
>>  0.037   0.015   0.053
>>
>>
>> 7 million obs of a single column xts is ~54 Mb.  Certainly you can
>> handle quite a bit of data if you have anything more than trivial
>> amounts of RAM.
>>
>> quantmod now has (devel) an attachSymbols function that makes
>> lazy-loading data very easy, so all your data can be stored as xts
>> objects and read in on-demand.
>>
>> xts is also getting the ability to query subsets of data on disk, by
>> time.  This will have no practical limit.
>>
>> For current data solutions xts, fts (C++), data.table, and some other
>> solutions should mitigate your problems, if not solve the 'data' side
>> all together.
>>
>>
>> HTH
>> Jeff
>>
>>
>>
>> On Thu, May 21, 2009 at 11:13 AM, Hae Kyung Im <hakyim at gmail.com> wrote:
>>> I think in general you would need some sort of pre-processing before using R.
>>>
>>> You can use periodic sampling of prices, but you may be throwing away
>>> a lot of information. This is a method that used to be recommended
>>> more than 5 years ago in order to mitigate the effect of market noise.
>>> At least in the context of volatility estimation.
>>>
>>> Here is my experience with tick data:
>>>
>>> I used FX data to calculate estimated daily volatility using TSRV
>>> (Zhang et al 2005
>>> http://galton.uchicago.edu/~mykland/paperlinks/p1394.pdf). Using the
>>> time series of estimated daily volatilities, I forecasted volatilities
>>> for 1 day up to 1 year ahead. The tick data was in Quantitative
>>> Analytics database. I used their C++ API to query daily data, computed
>>> the TSRV estimator in C++ and saved the result in text file. Then I
>>> used R to read the estimated volatilities and used FARIMA to forecast
>>> volatility. An interesting thing about this type of series is that the
>>> fractional coefficient is approximately 0.4 in many instances.
>>> Bollerslev has a paper commenting on this fact.
>>>
>>> In another project, I had treasury futures market depth data. The data
>>> came in plain text format, with one file per day. Each day had more
>>> than 1 million entries. I don't think I could handle this with R. To
>>> get started I decided to use only actual trades. I used Python to
>>> filter out the trades. So this came down to ~60K entries per day. This
>>> I could handle with R. I used to.period from xts package to aggregate
>>> the data.
>>>
>>> In order to handle market depth data, we need some efficient way to
>>> access (query) this huge database. I looked a little bit into kdb but
>>> you have to pay ~25K to buy the software for one processor. I haven't
>>> been able to look more into this for now.
>>>
>>> Haky
>>>
>>>
>>>
>>>
>>> On Thu, May 21, 2009 at 10:15 AM, Jeff Ryan <jeff.a.ryan at gmail.com> wrote:
>>>> Not my domain, but you will more than likely have to aggregate to some
>>>> sort of regular/homogenous type of series for most traditional tools
>>>> to work.
>>>>
>>>> xts has to.period to aggregate up to a lower frequency from tick-level
>>>> data. Coupled with something like na.locf you can make yourself some
>>>> high frequency 'regular' data from 'irregular'
>>>>
>>>> Regular and irregular of course depend on what you are looking at
>>>> (weekends missing in daily data can still be 'regular').
>>>>
>>>> I'd be interested in hearing thoughts from those who actually tread in
>>>> the high-freq domain...
>>>>
>>>> A wealth of information can be found here:
>>>>
>>>>  http://www.olsen.ch/publications/working-papers/
>>>>
>>>> Jeff
>>>>
>>>> On Thu, May 21, 2009 at 10:04 AM, Michael <comtech.usa at gmail.com> wrote:
>>>>> Hi all,
>>>>>
>>>>> I am wondering if there are some special toolboxes to handle high
>>>>> frequency data in R?
>>>>>
>>>>> I have some high frequency data and was wondering what meaningful
>>>>> experiments can I run on these high frequency data.
>>>>>
>>>>> Not sure if normal (low frequency) financial time series textbook data
>>>>> analysis tools will work for high frequency data?
>>>>>
>>>>> Let's say I run a correlation between two stocks using the high
>>>>> frequency data, or run an ARMA model on one stock, will the results be
>>>>> meaningful?
>>>>>
>>>>> Could anybody point me some classroom types of treatment or lab
>>>>> tutorial type of document which show me what meaningful
>>>>> experiments/tests I can run on high frequency data?
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>> _______________________________________________
>>>>> R-SIG-Finance at stat.math.ethz.ch mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
>>>>> -- Subscriber-posting only.
>>>>> -- If you want to post, subscribe first.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeffrey Ryan
>>>> jeffrey.ryan at insightalgo.com
>>>>
>>>> ia: insight algorithmics
>>>> www.insightalgo.com
>>>>
>>>> _______________________________________________
>>>> R-SIG-Finance at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
>>>> -- Subscriber-posting only.
>>>> -- If you want to post, subscribe first.
>>>>
>>>
>>
>>
>>
>> --
>> Jeffrey Ryan
>> jeffrey.ryan at insightalgo.com
>>
>> ia: insight algorithmics
>> www.insightalgo.com
>>
>



-- 
Jeffrey Ryan
jeffrey.ryan at insightalgo.com

ia: insight algorithmics
www.insightalgo.com



More information about the R-SIG-Finance mailing list