[R-SIG-Finance] Aggregating Statistics By Time Interval

Fri Aug 3 19:15:08 CEST 2007

Here is one more solution.  Using the tmp from your post, this one uses
SQLite via the sqldf package.  It produces a similar output as z from
our prior solution and then
we can use xtabs, reshape or the reshape package as before to get
the final layout.  The first subselect within the main select sums within
hour and spread and the second sums within hour.  We join the subselects
and take the ratio of the two the sums to get the answer.

In the solutions before we used hour relative to GMT rather than local time.

> library(sqldf)
> sqldf("select ahour, spread,
+   100 * aa.timediff / bb.timediff timediff from
+   (select
+     strftime('%H',a.time__1,'unixepoch') ahour,
+     strftime('%H',b.time__1,'unixepoch') bhour,
+     a.spread spread,
+     sum(b.time__1 - a.time__1) timediff
+    from tmp a, tmp b
+    where a.row_names = b.row_names-1 and ahour = bhour
+    group by ahour, a.spread) aa join
+    (select strftime('%H',c.time__1,'unixepoch') chour,
+       strftime('%H',d.time__1,'unixepoch') dhour,
+       sum(d.time__1 - c.time__1) timediff
+       from tmp c, tmp d
+       where c.row_names = d.row_names-1 and chour = dhour
+       group by chour) bb
+     where ahour = chour
+     group by spread, ahour",
+  row.names = TRUE)
  ahour spread  timediff
1    11  1e-04 91.358025
2    12  1e-04 92.613636
3    11  2e-04  8.641975
4    12  2e-04  5.681818
5    12  3e-04  1.704545

> # old solution for comparison
> twas <- function(x) {
+ y <- data.frame(timediff = diff(x$time), head(x, -1))
+ aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
+ }
> now <- Sys.time()
> epoch <- now - as.numeric(now)
> fmt <- function(x) format(epoch + x, "%H", tz = "GMT")
> tmp2 <- cbind(tmp, hour = fmt(tmp$time))
>
> z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
> z
     hour spread  timediff
11.1   11  1e-04 91.358025
11.2   11  2e-04  8.641975
12.1   12  1e-04 92.613636
12.2   12  2e-04  5.681818
12.3   12  3e-04  1.704545

Here is input:

library(sqldf)
sqldf("select ahour, spread,
	100 * aa.timediff / bb.timediff timediff from
	(select
	strftime('%H',a.time__1,'unixepoch') ahour,
	strftime('%H',b.time__1,'unixepoch') bhour,
	a.spread spread,
	sum(b.time__1 - a.time__1) timediff
	from tmp a, tmp b
	where a.row_names = b.row_names-1 and ahour = bhour
	group by ahour, a.spread) aa join
	(select strftime('%H',c.time__1,'unixepoch') chour,
		strftime('%H',d.time__1,'unixepoch') dhour,
		sum(d.time__1 - c.time__1) timediff
		from tmp c, tmp d
		where c.row_names = d.row_names-1 and chour = dhour
		group by chour) bb
	where ahour = chour
	group by spread, ahour",
	row.names = TRUE)

On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> The different invocations of twas were creating data frames of different
> numbers of columns because different hours had different numbers
> of spreads.  The warning came when it tried to rbind together
> data.frames with different numbers of columns.
>
> On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> > Wow....thats great. Thank you very much! I appreciate the help greatly. I
> > dont quite understand what the issue was though....was it that the data
> > frame returned from my initial twas() function was of the wrong order?
> >
> >
> > Cheers
> > Rory
> >
> > On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > > I had omitted fmt and epoch.  tmp is as in your post.
> > >
> > > twas <- function(x) {
> > >         y <- data.frame(timediff = diff(x$time), head(x, -1))
> > >         aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
> > > }
> > > now <- Sys.time()
> > > epoch <- now - as.numeric(now)
> > > fmt <- function(x) format(epoch + x, "%H")
> > > tmp2 <- cbind(tmp, hour = fmt(tmp$time))
> > >
> > > z <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
> > >
> > > # three alternatives
> > >
> > > # 1
> > > xtabs(timediff ~., z)
> > >
> > > # 2
> > > reshape(z, dir = "wide", timevar = "spread", idvar = "hour")
> > >
> > > # 3
> > > library(reshape)
> > > cast(melt(z, id = 1:2), hour ~ spread)
> > >
> > >
> > > On 8/3/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > > > Try producing it in "long" format using aggregate and then reshaping
> > > > it into "wide" format using xtabs, reshape or the reshape package:
> > > >
> > > > twas <- function(x) {
> > > >        y <- data.frame(timediff = diff(x$time), head(x, -1))
> > > >        aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
> > > > }
> > > > tmp2 <- cbind(tmp, hour = fmt(tmp$time))
> > > > long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))
> > > >
> > > > # any one of these three:
> > > >
> > > > xtabs(timediff ~., long)
> > > >
> > > > reshape(long, dir = "wide", timevar = "spread", idvar = "hour")
> > > >
> > > > library(reshape)
> > > > cast(melt(long, id = 1:2), hour ~ spread)
> > > >
> > > >
> > > > On 8/3/07, Rory Winston < rory.winston at gmail.com> wrote:
> > > > > Hi
> > > > >
> > > > > Sorry, I'm not sure what happened with that last one. Here is a fully
> > > > > contained example (sorry about the line length if this doesnt wrap).
> > > > >
> > > > > tmp <- data.frame(
> > > > >
> > time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
> > > > >
> > spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
> > > > > )
> > > > >
> > > > > twas <- function (dat)
> > > > > {
> > > > >     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > sum)/sum(diff(dat$time)) * 100)
> > > > > }
> > > > >
> > > > > now <- Sys.time()
> > > > > epoch <- now - as.numeric(now)
> > > > >
> > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > >
> > > > > Cheers
> > > > > Rory
> > > > >
> > > > >
> > > > > On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
> > > > > > I still get no warning.  Please provide complete self contained
> > input
> > > > > > and output.
> > > > > >
> > > > > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > > > > 1185882791,
> > > > > > +  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > > > > +  2e-04, 3e-04))
> > > > > > >
> > > > > > > twas <-
> > > > > > +  function(dat) {
> > > > > > +    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > > +  sum)/sum(diff(dat$time)) * 100.0)
> > > > > > + }
> > > > > > > now <- Sys.time()
> > > > > > > epoch <- now - as.numeric(now)
> > > > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
> > twas))
> > > > > > > z
> > > > > >       1e-04    2e-04
> > > > > > 07 66.66667 33.33333
> > > > > > >
> > > > > > > R.version.string # XP
> > > > > > [1] "R version 2.5.1 (2007-06-27)"
> > > > > >
> > > > > >
> > > > > > On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> > > > > > > Hi
> > > > > > >
> > > > > > > I have figured out what causes the warning (and recycling), but I
> > am not
> > > > > > > sure how I can fix it. After seeing that it seemed to work for
> > you, I
> > > > > went
> > > > > > > back and tried working with different subsets of the data. I
> > eventually
> > > > > > > found where it occurs - when we get a third unique spread value.
> > To
> > > > > > > reproduce, just change the definition of tmp to be:
> > > > > > >
> > > > > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > > > > 1185882791,
> > > > > > >  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > > > > >  2e-04, 3e-04)) <== Added 3e-04
> > > > > > >
> > > > > > > i.e. I have just changed one of the spread values to be a third
> > value -
> > > > > this
> > > > > > > seems to trigger the warning  "Warning message:number of columns
> > of
> > > > > result
> > > > > > > is not a multiple of vector length (arg 3) in: rbind", and the
> > > > > recycling. I
> > > > > > > tried this on R 2.5.0 and 2.5.1
> > > > > > >
> > > > > > > Can anyone see what I am doing wrong here?
> > > > > > >
> > > > > > > Cheers
> > > > > > > Rory
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
> > > > > > > > Can you provide a reproducible example that exhibits the
> > warning.
> > > > > > > > Redoing it in a more easily reproducible way and using the data
> > > > > > > > in your post gives me no warning
> > > > > > > >
> > > > > > > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > > > > > > 1185882791,
> > > > > > > > + 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04,
> > 1e-04,
> > > > > > > > + 2e-04, 1e-04))
> > > > > > > > >
> > > > > > > > > twas <-
> > > > > > > > +  function(dat) {
> > > > > > > > +     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > > > > +  sum)/sum(diff(dat$time)) * 100.0)
> > > > > > > > + }
> > > > > > > > > now <- Sys.time()
> > > > > > > > > epoch <- now - as.numeric(now)
> > > > > > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
> > twas))
> > > > > > > > > z
> > > > > > > >       1e-04    2e-04
> > > > > > > > 07 66.66667 33.33333
> > > > > > > > > R.version.string # XP
> > > > > > > > [1] "R version 2.5.1 (2007-06-27)"
> > > > > > > >
> > > > > > > >
> > > > > > > > Here is input:
> > > > > > > >
> > > > > > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > > > > 1185882791,
> > > > > > > > 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > > > > > > 2e-04, 1e-04))
> > > > > > > > twas <-
> > > > > > > > function(dat) {
> > > > > > > >    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > > > > sum)/sum(diff(dat$time)) * 100.0)
> > > > > > > > }
> > > > > > > > now <- Sys.time ()
> > > > > > > > epoch <- now - as.numeric(now)
> > > > > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
> > twas))
> > > > > > > > z
> > > > > > > > R.version.string # XP
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 8/3/07, Rory Winston <rory.winston at gmail.com > wrote:
> > > > > > > > > Hi
> > > > > > > > >
> > > > > > > > > I've been wrestling with this a little bit, using the example
> > in the
> > > > > > > email
> > > > > > > > > that Gabor pointed me to as a reference, and I think I have
> > almost
> > > > > got
> > > > > > > what
> > > > > > > > > I want...however its still not quite right.
> > > > > > > > >
> > > > > > > > > I have a variable, tmp, with two dimensions: time and spread:
> > > > > > > > >
> > > > > > > > > > head(tmp$time)
> > > > > > > > > [1] 1185882786 1185882790 1185882791 1185882791 1185882792
> > > > > 1185882795
> > > > > > > > >
> > > > > > > > > > head(tmp$spread)
> > > > > > > > > [1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I also have a function that calculates the time-weighted
> > average
> > > > > spread:
> > > > > > > > >
> > > > > > > > > > twas
> > > > > > > > > function(dat) {
> > > > > > > > >   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > > > > > sum)/sum(diff(dat$time)) * 100.0)
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > I can combine them using as rbind() and by():
> > > > > > > > >
> > > > > > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"),
> > twas))
> > > > > > > > >
> > > > > > > > > (epoch is just an instance of ISOdatetime)
> > > > > > > > >
> > > > > > > > > This gives me a warning:
> > > > > > > > >
> > > > > > > > > Warning message:
> > > > > > > > > number of columns of result
> > > > > > > > >        is not a multiple of vector length (arg 3) in: rbind(1,
> > "12"
> > > > > = c(
> > > > > > > > > 91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708,
> > > > > > > > >
> > > > > > > > > The output from the above command is almost exactly what I
> > need,
> > > > > apart
> > > > > > > from
> > > > > > > > > the recycling:
> > > > > > > > >
> > > > > > > > >      1e-04     2e-04      3e-04        4e-04
> > > > > > > > > 12 91.99208  8.007925 91.9920754  8.007924587 <== recycled
> > values
> > > > > > > > > 13 90.18850  9.337448  0.4218405  0.052214551
> > > > > > > > > 14 90.59640  9.171417  0.2321811 90.596401668
> > > > > > > > > 15 89.55771 10.194291  0.2343418  0.013661453
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > I can just pass this into a barplot() and get a nice visual
> > > > > breakdown of
> > > > > > > > > hourly weighted spreads, *but* I dont know how to get these
> > results
> > > > > > > without
> > > > > > > > > the recycling. Looking at rbind(), it seems that this will
> > > > > automatically
> > > > > > > > > recycle. Does anyone know of a function I could use to get
> > these
> > > > > results
> > > > > > > > > without this problem?
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > > Rory
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com >
> > wrote:
> > > > > > > > > >
> > > > > > > > > > Something similar was just discussed this morning:
> > > > > > > > > >
> > > > > > >
> > > > >
> > https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
> > > > > > > > > > > Hi all
> > > > > > > > > > >
> > > > > > > > > > > I have a question about aggegating statistics by time
> > intervals.
> > > > > I
> > > > > > > have
> > > > > > > > > > a
> > > > > > > > > > > data set with 3 columns : time, bid, and ask. Time is
> > specified
> > > > > as a
> > > > > > > > > > > millisecond timestamp since epoch. I would like to compute
> > > > > summary
> > > > > > > > > > > statistics for the data set on an hourly basis. Here is
> > what I
> > > > > have
> > > > > > > > > > tried so
> > > > > > > > > > > far:
> > > > > > > > > > >
> > > > > > > > > > > # Data is in pricedata
> > > > > > > > > > >
> > > > > > > > > > > t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
> > > > > > > > > > > agg <- aggregate(pricedata$spread, list(byhour=format(t,
> > "%Y-%m
> > > > > > > %H")),
> > > > > > > > > > mean)
> > > > > > > > > > >
> > > > > > > > > > > This seems to do what I want - however, what really want
> > to do
> > > > > is
> > > > > > > more
> > > > > > > > > > > specific: I would like to be able to extract a subset of
> > the
> > > > > data
> > > > > > > frame
> > > > > > > > > > > pricedata, and not just the aggregated entries - for
> > instance,
> > > > > > > instead
> > > > > > > > > > of
> > > > > > > > > > > just extracting pricedata$spread by hour, I would like to
> > > > > extract a
> > > > > > > > > > slice of
> > > > > > > > > > > columns, e.g. pricedata$spread and pricedata$time on an
> > hourly
> > > > > > > basis,
> > > > > > > > > > and
> > > > > > > > > > > pass these into a function that can compute a
> > time-weighted
> > > > > average
> > > > > > > > > > spread,
> > > > > > > > > > > for instance. Does anyone know an elegant way to do this?
> > I have
> > > > > a
> > > > > > > > > > feeling
> > > > > > > > > > > zoo may do what I want, but I'm new to zoo ...
> > > > > > > > > > >
> > > > > > > > > > > Cheers
> > > > > > > > > > > Rory
> > > > > > > > > > >
> > > > > > > > > > >        [[alternative HTML version deleted]]
> > > > > > > > > > >
> > > > > > > > > > >
> > _______________________________________________
> > > > > > > > > > > R-SIG-Finance at stat.math.ethz.ch mailing list
> > > > > > > > > > >
> > > > > https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> > > > > > > > > > > -- Subscriber-posting only.
> > > > > > > > > > > -- If you want to post, subscribe first.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >        [[alternative HTML version deleted]]
> > > > > > > > >
> > > > > > > > >
> > _______________________________________________
> > > > > > > > > R-SIG-Finance at stat.math.ethz.ch mailing list
> > > > > > > > >
> > https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> > > > > > > > > -- Subscriber-posting only.
> > > > > > > > > -- If you want to post, subscribe first.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> >
>