[R-SIG-Finance] Aggregating Statistics By Time Interval

Gabor Grothendieck ggrothendieck at gmail.com
Fri Aug 3 16:36:03 CEST 2007


I had omitted fmt and epoch.  tmp is as in your post.

twas <- function(x) {
	y <- data.frame(timediff = diff(x$time), head(x, -1))
	aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
now <- Sys.time()
epoch <- now - as.numeric(now)
fmt <- function(x) format(epoch + x, "%H")
tmp2 <- cbind(tmp, hour = fmt(tmp$time))

z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# three alternatives

# 1
xtabs(timediff ~., z)

# 2
reshape(z, dir = "wide", timevar = "spread", idvar = "hour")

# 3
library(reshape)
cast(melt(z, id = 1:2), hour ~ spread)


On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Try producing it in "long" format using aggregate and then reshaping
> it into "wide" format using xtabs, reshape or the reshape package:
>
> twas <- function(x) {
>        y <- data.frame(timediff = diff(x$time), head(x, -1))
>        aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
> }
> tmp2 <- cbind(tmp, hour = fmt(tmp$time))
> long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
>
> # any one of these three:
>
> xtabs(timediff ~., long)
>
> reshape(long, dir = "wide", timevar = "spread", idvar = "hour")
>
> library(reshape)
> cast(melt(long, id = 1:2), hour ~ spread)
>
>
> On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> > Hi
> >
> > Sorry, I'm not sure what happened with that last one. Here is a fully
> > contained example (sorry about the line length if this doesnt wrap).
> >
> > tmp <- data.frame(
> > time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
> > spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
> > )
> >
> > twas <- function (dat)
> > {
> >     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > sum)/sum(diff(dat$time)) * 100)
> > }
> >
> > now <- Sys.time()
> > epoch <- now - as.numeric(now)
> >
> > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> >
> > Cheers
> > Rory
> >
> >
> > On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
> > > I still get no warning.  Please provide complete self contained input
> > > and output.
> > >
> > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > 1185882791,
> > > +  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > +  2e-04, 3e-04))
> > > >
> > > > twas <-
> > > +  function(dat) {
> > > +    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > +  sum)/sum(diff(dat$time)) * 100.0)
> > > + }
> > > > now <- Sys.time()
> > > > epoch <- now - as.numeric(now)
> > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > z
> > >       1e-04    2e-04
> > > 07 66.66667 33.33333
> > > >
> > > > R.version.string # XP
> > > [1] "R version 2.5.1 (2007-06-27)"
> > >
> > >
> > > On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> > > > Hi
> > > >
> > > > I have figured out what causes the warning (and recycling), but I am not
> > > > sure how I can fix it. After seeing that it seemed to work for you, I
> > went
> > > > back and tried working with different subsets of the data. I eventually
> > > > found where it occurs - when we get a third unique spread value. To
> > > > reproduce, just change the definition of tmp to be:
> > > >
> > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > 1185882791,
> > > >  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > >  2e-04, 3e-04)) <== Added 3e-04
> > > >
> > > > i.e. I have just changed one of the spread values to be a third value -
> > this
> > > > seems to trigger the warning  "Warning message:number of columns of
> > result
> > > > is not a multiple of vector length (arg 3) in: rbind", and the
> > recycling. I
> > > > tried this on R 2.5.0 and 2.5.1
> > > >
> > > > Can anyone see what I am doing wrong here?
> > > >
> > > > Cheers
> > > > Rory
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
> > > > > Can you provide a reproducible example that exhibits the warning.
> > > > > Redoing it in a more easily reproducible way and using the data
> > > > > in your post gives me no warning
> > > > >
> > > > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > > > 1185882791,
> > > > > + 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > > > + 2e-04, 1e-04))
> > > > > >
> > > > > > twas <-
> > > > > +  function(dat) {
> > > > > +     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > +  sum)/sum(diff(dat$time)) * 100.0)
> > > > > + }
> > > > > > now <- Sys.time()
> > > > > > epoch <- now - as.numeric(now)
> > > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > > > z
> > > > >       1e-04    2e-04
> > > > > 07 66.66667 33.33333
> > > > > > R.version.string # XP
> > > > > [1] "R version 2.5.1 (2007-06-27)"
> > > > >
> > > > >
> > > > > Here is input:
> > > > >
> > > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > 1185882791,
> > > > > 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > > > 2e-04, 1e-04))
> > > > > twas <-
> > > > > function(dat) {
> > > > >    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > sum)/sum(diff(dat$time)) * 100.0)
> > > > > }
> > > > > now <- Sys.time()
> > > > > epoch <- now - as.numeric(now)
> > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > > z
> > > > > R.version.string # XP
> > > > >
> > > > >
> > > > >
> > > > > On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> > > > > > Hi
> > > > > >
> > > > > > I've been wrestling with this a little bit, using the example in the
> > > > email
> > > > > > that Gabor pointed me to as a reference, and I think I have almost
> > got
> > > > what
> > > > > > I want...however its still not quite right.
> > > > > >
> > > > > > I have a variable, tmp, with two dimensions: time and spread:
> > > > > >
> > > > > > > head(tmp$time)
> > > > > > [1] 1185882786 1185882790 1185882791 1185882791 1185882792
> > 1185882795
> > > > > >
> > > > > > > head(tmp$spread)
> > > > > > [1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
> > > > > > >
> > > > > >
> > > > > > I also have a function that calculates the time-weighted average
> > spread:
> > > > > >
> > > > > > > twas
> > > > > > function(dat) {
> > > > > >   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > > sum)/sum(diff(dat$time)) * 100.0)
> > > > > > }
> > > > > >
> > > > > > I can combine them using as rbind() and by():
> > > > > >
> > > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > > >
> > > > > > (epoch is just an instance of ISOdatetime)
> > > > > >
> > > > > > This gives me a warning:
> > > > > >
> > > > > > Warning message:
> > > > > > number of columns of result
> > > > > >        is not a multiple of vector length (arg 3) in: rbind(1, "12"
> > = c(
> > > > > > 91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708,
> > > > > >
> > > > > > The output from the above command is almost exactly what I need,
> > apart
> > > > from
> > > > > > the recycling:
> > > > > >
> > > > > >      1e-04     2e-04      3e-04        4e-04
> > > > > > 12 91.99208  8.007925 91.9920754  8.007924587 <== recycled values
> > > > > > 13 90.18850  9.337448  0.4218405  0.052214551
> > > > > > 14 90.59640  9.171417  0.2321811 90.596401668
> > > > > > 15 89.55771 10.194291  0.2343418  0.013661453
> > > > > > ...
> > > > > >
> > > > > > I can just pass this into a barplot() and get a nice visual
> > breakdown of
> > > > > > hourly weighted spreads, *but* I dont know how to get these results
> > > > without
> > > > > > the recycling. Looking at rbind(), it seems that this will
> > automatically
> > > > > > recycle. Does anyone know of a function I could use to get these
> > results
> > > > > > without this problem?
> > > > > >
> > > > > > Cheers
> > > > > > Rory
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com > wrote:
> > > > > > >
> > > > > > > Something similar was just discussed this morning:
> > > > > > >
> > > >
> > https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html
> > > > > > >
> > > > > > >
> > > > > > > On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
> > > > > > > > Hi all
> > > > > > > >
> > > > > > > > I have a question about aggegating statistics by time intervals.
> > I
> > > > have
> > > > > > > a
> > > > > > > > data set with 3 columns : time, bid, and ask. Time is specified
> > as a
> > > > > > > > millisecond timestamp since epoch. I would like to compute
> > summary
> > > > > > > > statistics for the data set on an hourly basis. Here is what I
> > have
> > > > > > > tried so
> > > > > > > > far:
> > > > > > > >
> > > > > > > > # Data is in pricedata
> > > > > > > >
> > > > > > > > t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
> > > > > > > > agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
> > > > %H")),
> > > > > > > mean)
> > > > > > > >
> > > > > > > > This seems to do what I want - however, what really want to do
> > is
> > > > more
> > > > > > > > specific: I would like to be able to extract a subset of the
> > data
> > > > frame
> > > > > > > > pricedata, and not just the aggregated entries - for instance,
> > > > instead
> > > > > > > of
> > > > > > > > just extracting pricedata$spread by hour, I would like to
> > extract a
> > > > > > > slice of
> > > > > > > > columns, e.g. pricedata$spread and pricedata$time on an hourly
> > > > basis,
> > > > > > > and
> > > > > > > > pass these into a function that can compute a time-weighted
> > average
> > > > > > > spread,
> > > > > > > > for instance. Does anyone know an elegant way to do this? I have
> > a
> > > > > > > feeling
> > > > > > > > zoo may do what I want, but I'm new to zoo ...
> > > > > > > >
> > > > > > > > Cheers
> > > > > > > > Rory
> > > > > > > >
> > > > > > > >        [[alternative HTML version deleted]]
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > R-SIG-Finance at stat.math.ethz.ch mailing list
> > > > > > > >
> > https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> > > > > > > > -- Subscriber-posting only.
> > > > > > > > -- If you want to post, subscribe first.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >        [[alternative HTML version deleted]]
> > > > > >
> > > > > > _______________________________________________
> > > > > > R-SIG-Finance at stat.math.ethz.ch mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> > > > > > -- Subscriber-posting only.
> > > > > > -- If you want to post, subscribe first.
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>



More information about the R-SIG-Finance mailing list