[R-SIG-Finance] SUMMARY: Reducing an intra-day dataset into one obs per second
Ajay Shah
ajayshah at mayin.org
Sun Dec 6 15:24:00 CET 2009
I was faced with the following problem:
There is a zoo object containing intra-day data, where certain
columns of information are observed at sub-second resolution. We
want to reduce it to one record per second, containing the last
record observed within each second. If there is no information for
one full second, then an empty record containing NAs should be
emitted.
For an example of this data, say
library(zoo)
load(url("http://www.mayin.org/ajayshah/tmp/demo.rda"))
options("digits.secs"=6)
head(b)
tail(b)
The zoo object `b' in demo.rda happens to have 99,298 rows, 86 columns
and occupies 72 Megabytes of core. So it's a bulky object - and is
typical of the stuff happening on the top 20 exchanges in the world.
The immense genius of Gabor Grothendieck and Jeff Ryan guided me to
solutions that were incredibly better than mine. At
https://stat.ethz.ch/pipermail/r-sig-finance/2009q4/005144.html is an
earlier discussion on this same problem, but after that a good chunk
of conversation shifted to email. The results of this are summarised
here. After the code is some analysis of performance.
-----------------------------------------------snipsnipsnip------------------
library(xts)
print(load(url("http://www.mayin.org/ajayshah/tmp/demo.rda")))
soln1 <- function(z) {
# Note that x[1] - as.numeric(x[1]) is the origin (1970-01-01)
toNsec <- function(x, Nsec = 1) Nsec * floor(as.numeric(x) / Nsec + 1.e-7) + (x[1] - as.numeric(x[1]))
# aggregate by seconds
d <- aggregate(z, toNsec, tail, 1)
merge(d, zoo(, seq(start(d), end(d), "sec")))
}
# We aggregate 1:NROW(z) instead of z
soln2 <- function(z) {
# Note that x[1] - as.numeric(x[1]) is the origin (1970-01-01)
toNsec <- function(x, Nsec = 1) Nsec * floor(as.numeric(x) / Nsec + 1e-7) + (x[1] - as.numeric(x[1]))
# aggregate 1:nrow(z) by seconds - ix picks last among all at same time
ix <- aggregate(zoo(seq_len(NROW(z))), toNsec(time(z)), tail, 1)
# apply ix and merge onto grid - uses [,] so z must be 2d
merge(zoo(coredata(z)[coredata(ix), ], time(ix)), zoo(, seq(start(ix), end(ix), "sec")))
}
# Use as.integer and duplicated instead of toNsec and aggregate.zoo
soln3 <- function(b) {
# discretize time into one second bins
st <- start(b)
time(b) <- as.integer(time(b)+1e-7) + st - as.numeric(st)
## find index of last value in each one second interval
ix <- !duplicated(time(b), fromLast = TRUE)
## merge with grid
merge(b[ix], zoo(, seq(start(b), end(b), "sec")))
}
# xts version by Jeff Ryan
soln4 <- function(z, Nsec) {
bx.1s <- align.time(z, Nsec)
time(bx.1s) <- time(bx.1s) + 1
merge(bx.1s[endpoints(bx.1s,"secs")], seq(start(bx.1s),end(bx.1s),by="secs"))
}
# Measure performance
gc(reset=TRUE)
cost.1 <- system.time({res.1 <- soln1(b)})
gc(reset=TRUE)
cost.2 <- system.time({res.2 <- soln2(b)})
gc(reset=TRUE)
cost.3 <- system.time({res.3 <- soln3(b)})
gc(reset=TRUE)
xts.conversion.cost <- system.time({bx <- as.xts(b)})
cost.4 <- system.time({res.4 <- soln4(bx,1)})
gc(reset=TRUE)
cost.1/cost.2
cost.1/cost.3
cost.1/cost.4
cost.1/(cost.4+xts.conversion.cost)
summary <- rbind(cost.1, cost.2, cost.3, cost.4, xts.conversion.cost)[,1:3]
rownames(summary) <- c("Soln1","Soln2","Soln3", "XTS","XTS Conversion")
summary
# Verify correctness
nc <- ncol(b)
all.equal(res.1, res.2, check.attributes = FALSE)
all.equal(res.2, res.3, check.attributes = FALSE)
all.equal(unclass(res.3), unclass(res.4), check.attributes = FALSE)
-----------------------------------------------snipsnipsnip------------------
Now here's the performance on my machine - a Macbook Pro with a Intel
Core 2 Duo processor running at 2.53 GHz. There is 4G of RAM so memory
isn't a constraint on this one -- for this small demo.rda. For
realistic problems I am finding memory to be a HUGE constraint, so
efficiency in use of memory is a crucial issue.
The cost seen on my machine is:
user.self sys.self elapsed
Soln1 66.832 4.954 74.551
Soln2 1.566 0.437 2.069
Soln3 0.547 0.483 1.057
XTS 0.316 0.331 0.665
XTS Conversion 0.900 0.816 1.757
What do we see here?
* Solution 1 seems reasonable but it's horribly slow - 74.55 seconds
for this toy dataset.
* Solution 2 and 3 are all-zoo and are 36x and 70x faster. This is
great!
* The xts solution is 112 times faster!
* But xts conversion is costly, burning 0.665 ms. If we are converting
to xts solely for the purpose of getting this done, then the gain
compared with Solution 1 is only 31x. In this case, Solution 3
is the best, particularly because it also does not require memory
to house the xts object and the zoo object in core at the same time.
In short, it seems that if you are already using xts, then xts is the
fastest solution here. Xts is clearly doing something right with the
fastest solution (112x faster than Solution 1). But if you're
presently a zoo user, then switching to xts purely for the purpose of
doing this one thing is not efficient.
My code embeds gc(reset=TRUE) commands but I'm not able to understand
them fully. Efficiency of memory use *is* an important problem when
dealing with these gigantic intra-day datasets and it'll be great if
others can help in thinking more effectively about memory efficiency
using this information.
I am grateful to all on r-sig-finance who helped me steer towards
this, and particularly Gabor and Jeff who wrote the above code. And of
course, we in the R world are privileged to have outstanding software
systems like zoo and xts to choose from. It's a delight to build stuff
here.
--
Ajay Shah http://www.mayin.org/ajayshah
ajayshah at mayin.org http://ajayshahblog.blogspot.com
<*(:-? - wizard who doesn't know the answer.
More information about the R-SIG-Finance
mailing list