[R-SIG-Finance] SUMMARY: Reducing an intra-day dataset into one obs per second

Sun Dec 6 15:24:00 CET 2009

I was faced with the following problem:

  There is a zoo object containing intra-day data, where certain
  columns of information are observed at sub-second resolution. We
  want to reduce it to one record per second, containing the last
  record observed within each second. If there is no information for
  one full second, then an empty record containing NAs should be
  emitted.

For an example of this data, say
  library(zoo)
  load(url("http://www.mayin.org/ajayshah/tmp/demo.rda"))
  options("digits.secs"=6)
  head(b)
  tail(b)

The zoo object `b' in demo.rda happens to have 99,298 rows, 86 columns
and occupies 72 Megabytes of core. So it's a bulky object - and is
typical of the stuff happening on the top 20 exchanges in the world.

The immense genius of Gabor Grothendieck and Jeff Ryan guided me to
solutions that were incredibly better than mine. At
https://stat.ethz.ch/pipermail/r-sig-finance/2009q4/005144.html is an
earlier discussion on this same problem, but after that a good chunk
of conversation shifted to email. The results of this are summarised
here. After the code is some analysis of performance.

-----------------------------------------------snipsnipsnip------------------
    library(xts)

    print(load(url("http://www.mayin.org/ajayshah/tmp/demo.rda")))

    soln1 <- function(z) {
      # Note that x[1] - as.numeric(x[1]) is the origin (1970-01-01)
      toNsec <- function(x, Nsec = 1) Nsec * floor(as.numeric(x) / Nsec + 1.e-7) + (x[1] - as.numeric(x[1]))

      # aggregate by seconds
      d <- aggregate(z, toNsec, tail, 1)

      merge(d, zoo(, seq(start(d), end(d), "sec")))
    }

    # We aggregate 1:NROW(z) instead of z
    soln2 <- function(z) {
      # Note that x[1] - as.numeric(x[1]) is the origin (1970-01-01)
      toNsec <- function(x, Nsec = 1) Nsec * floor(as.numeric(x) / Nsec + 1e-7) + (x[1] - as.numeric(x[1]))

      # aggregate 1:nrow(z) by seconds - ix picks last among all at same time
      ix <- aggregate(zoo(seq_len(NROW(z))), toNsec(time(z)), tail, 1)

      # apply ix and merge onto grid - uses [,] so z must be 2d
      merge(zoo(coredata(z)[coredata(ix), ], time(ix)), zoo(, seq(start(ix), end(ix), "sec")))
    }

    # Use as.integer and duplicated instead of toNsec and aggregate.zoo
    soln3 <- function(b) {

      # discretize time into one second bins
      st <- start(b)
      time(b) <- as.integer(time(b)+1e-7) + st - as.numeric(st)

      ## find index of last value in each one second interval
      ix <- !duplicated(time(b), fromLast = TRUE)

      ## merge with grid
      merge(b[ix], zoo(, seq(start(b), end(b), "sec")))
    }

    # xts version by Jeff Ryan
    soln4 <- function(z, Nsec) {
      bx.1s <- align.time(z, Nsec)
      time(bx.1s) <- time(bx.1s) + 1
      merge(bx.1s[endpoints(bx.1s,"secs")], seq(start(bx.1s),end(bx.1s),by="secs"))
    }

    # Measure performance
    gc(reset=TRUE)
    cost.1 <- system.time({res.1 <- soln1(b)})
    gc(reset=TRUE)
    cost.2 <- system.time({res.2 <- soln2(b)})
    gc(reset=TRUE)
    cost.3 <- system.time({res.3 <- soln3(b)})
    gc(reset=TRUE)
    xts.conversion.cost <- system.time({bx <- as.xts(b)})
    cost.4 <- system.time({res.4 <- soln4(bx,1)})
    gc(reset=TRUE)

    cost.1/cost.2
    cost.1/cost.3
    cost.1/cost.4
    cost.1/(cost.4+xts.conversion.cost)

    summary <- rbind(cost.1, cost.2, cost.3, cost.4, xts.conversion.cost)[,1:3]
    rownames(summary) <- c("Soln1","Soln2","Soln3", "XTS","XTS Conversion")
    summary

    # Verify correctness
    nc <- ncol(b)
    all.equal(res.1, res.2, check.attributes = FALSE)
    all.equal(res.2, res.3, check.attributes = FALSE)
    all.equal(unclass(res.3), unclass(res.4), check.attributes = FALSE)

-----------------------------------------------snipsnipsnip------------------

Now here's the performance on my machine - a Macbook Pro with a Intel
Core 2 Duo processor running at 2.53 GHz. There is 4G of RAM so memory
isn't a constraint on this one -- for this small demo.rda. For
realistic problems I am finding memory to be a HUGE constraint, so
efficiency in use of memory is a crucial issue.

The cost seen on my machine is:

               user.self sys.self elapsed
Soln1             66.832    4.954  74.551
Soln2              1.566    0.437   2.069
Soln3              0.547    0.483   1.057
XTS                0.316    0.331   0.665
XTS Conversion     0.900    0.816   1.757

What do we see here?

  * Solution 1 seems reasonable but it's horribly slow - 74.55 seconds
    for this toy dataset.

  * Solution 2 and 3 are all-zoo and are 36x and 70x faster. This is
    great!

  * The xts solution is 112 times faster!

  * But xts conversion is costly, burning 0.665 ms. If we are converting
    to xts solely for the purpose of getting this done, then the gain
    compared with Solution 1 is only 31x. In this case, Solution 3
    is the best, particularly because it also does not require memory
    to house the xts object and the zoo object in core at the same time.

In short, it seems that if you are already using xts, then xts is the
fastest solution here. Xts is clearly doing something right with the
fastest solution (112x faster than Solution 1). But if you're
presently a zoo user, then switching to xts purely for the purpose of
doing this one thing is not efficient.

My code embeds gc(reset=TRUE) commands but I'm not able to understand
them fully. Efficiency of memory use *is* an important problem when
dealing with these gigantic intra-day datasets and it'll be great if
others can help in thinking more effectively about memory efficiency
using this information.

I am grateful to all on r-sig-finance who helped me steer towards
this, and particularly Gabor and Jeff who wrote the above code. And of
course, we in the R world are privileged to have outstanding software
systems like zoo and xts to choose from. It's a delight to build stuff
here.

-- 
Ajay Shah                                      http://www.mayin.org/ajayshah  
ajayshah at mayin.org                             http://ajayshahblog.blogspot.com
<*(:-? - wizard who doesn't know the answer.