[R-SIG-Finance] Discretising intra-day data -- how to get by with less memory?

Fri Nov 27 16:07:23 CET 2009

The three functions that can be found in xts to help here are:

(1) align.time:  (as Brian alluded to)
This will simply shift all times to the next n-th second specified.
e.g. align.time(x, n=300)  # 5 minutes

(2) endpoints:
Locate the last time-stamp (obs in time-series) for each "k" "on" periods
e.g. endpoints(x, on="minutes", k=5)  # 5 minutes

(3) merge.xts with a regular time index.
e.g. merge(x, xts(, timeBasedSeq('2009-11-01 08:30/2009-11-01 03:00')))

A complete example:
> x <- xts(1:10, Sys.time()+1:10*rnorm(10)*60)
> x
                    [,1]
2009-11-27 08:48:18    9
2009-11-27 08:51:03    7
2009-11-27 08:52:13    8
2009-11-27 08:53:10   10
2009-11-27 08:55:25    6
2009-11-27 08:55:56    1
2009-11-27 08:56:02    4
2009-11-27 08:56:44    3
2009-11-27 08:59:24    2
2009-11-27 09:02:46    5

> xa <- align.time(x,60)  # align to end of minutes
> xa
                    [,1]
2009-11-27 08:49:00    9
2009-11-27 08:52:00    7
2009-11-27 08:53:00    8
2009-11-27 08:54:00   10
2009-11-27 08:56:00    6
2009-11-27 08:56:00    1
2009-11-27 08:57:00    4
2009-11-27 08:57:00    3
2009-11-27 09:00:00    2
2009-11-27 09:03:00    5

> xa[endpoints(xa,'minutes')]  # get last obs with unique timestamp
                    [,1]
2009-11-27 08:49:00    9
2009-11-27 08:52:00    7
2009-11-27 08:53:00    8
2009-11-27 08:54:00   10
2009-11-27 08:56:00    1
2009-11-27 08:57:00    3
2009-11-27 09:00:00    2
2009-11-27 09:03:00    5

> # fill in 'regular' time series
> merge(xa[endpoints(xa,'minutes')], xts( ,seq(start(xa),end(xa),by="mins")))
                    xa.endpoints.xa...minutes...
2009-11-27 08:49:00                            9
2009-11-27 08:50:00                           NA
2009-11-27 08:51:00                           NA
2009-11-27 08:52:00                            7
2009-11-27 08:53:00                            8
2009-11-27 08:54:00                           10
2009-11-27 08:55:00                           NA
2009-11-27 08:56:00                            1
2009-11-27 08:57:00                            3
2009-11-27 08:58:00                           NA
2009-11-27 08:59:00                           NA
2009-11-27 09:00:00                            2
2009-11-27 09:01:00                           NA
2009-11-27 09:02:00                           NA
2009-11-27 09:03:00                            5

> # optional fill=na.locf will carry forward the last observation (last trade?)
> merge(xa[endpoints(xa,'minutes')], xts( ,seq(start(xa),end(xa),by="mins")),fill=na.locf)
                    xa.endpoints.xa...minutes...
2009-11-27 08:49:00                            9
2009-11-27 08:50:00                            9
2009-11-27 08:51:00                            9
2009-11-27 08:52:00                            7
2009-11-27 08:53:00                            8
2009-11-27 08:54:00                           10
2009-11-27 08:55:00                           10
2009-11-27 08:56:00                            1
2009-11-27 08:57:00                            3
2009-11-27 08:58:00                            3
2009-11-27 08:59:00                            3
2009-11-27 09:00:00                            2
2009-11-27 09:01:00                            2
2009-11-27 09:02:00                            2
2009-11-27 09:03:00                            5

I didn't test against your solution(s), but this should be very fast
and use as little memory as possible.  endpoints, align.time and
merge.xts have all been heavily optimized for speed and memory.

HTH
Jeff

On Fri, Nov 27, 2009 at 7:00 AM, Brian G. Peterson <brian at braverock.com> wrote:
> Brian G. Peterson wrote:
>>
>> Ajay Shah wrote:
>>>
>>> I'm using this function to convert intra-day data into a grid with an
>>> observation each N seconds:
>>>
>>>  # This function consumes "z" a zoo object where timestamps are intraday
>>>  # and a period for discretisation Nseconds.
>>>  # The key ideas are from this thread:
>>>  #    https://stat.ethz.ch/pipermail/r-sig-finance/2009q4/005144.html
>>>  intraday.discretise <- function(z, Nseconds) {
>>>    toNsec <- function(x)
>>> as.POSIXct(Nseconds*ceiling(as.numeric(x)/Nseconds),
>>>                                   origin = "1970-01-01")
>>>    d <- aggregate(z, toNsec, tail, 1)
>>>    # At this point there is one problem: NA records are not created
>>>    # for blocks of time in which there were no records.
>>>
>>>    # To solve this:
>>>    dreg <- as.zoo(as.ts(d))
>>>    class(time(dreg)) <- class(time(d))
>>>
>>>    dreg
>>>  }
>>>
>>> This works correctly but it's incredibly memory-intensive. I'm running
>>> out of core in running this for some problems.
>>>
>>> Is there a way to write this which would use less RAM?
>>>
>>>
>>
>> Jeff Ryan, Abe Winter, and I came up with an align.time function a few
>> months back:
>>
>> align.time <- function(x, n=30) {
>>  structure(unclass(x) + (n - unclass(x) %% n),
>> class=c("POSIXt","POSIXct")) }
>>
>> x is xts data
>> n is seconds
>>
>> Regards,
>>
>>  - Brian
>>
> Or, an earlier, slower version:
>
> this works well enough to generate a new index on the output of to.period:
>
> # stamp is POSIXct object, like index(x) of an xts object
> # n is number of seconds to round to, so n=k in to.period
> even_seconds = function(stamp,n=60)
> {
>  tzone = attr(stamp,"tzone")
>  if (is.null(tzone)) { tzone = "" }
>  base = as.POSIXct(strptime( format(stamp,"%Y%m%d"), "%Y%m%d" ),tz=tzone)
>  i = as.numeric(stamp) - as.numeric(base)
>  i = base + n*ceiling(i/n)
>  i
> }
>
>
>
> --
> Brian G. Peterson
> http://braverock.com/brian/
> Ph: 773-459-4973
> IM: bgpbraverock
>
> _______________________________________________
> R-SIG-Finance at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> -- Subscriber-posting only.
> -- If you want to post, subscribe first.
>

-- 
Jeffrey Ryan
jeffrey.ryan at insightalgo.com

ia: insight algorithmics
www.insightalgo.com