[R-SIG-Finance] Aggregating Statistics By Time Interval

Gabor Grothendieck ggrothendieck at gmail.com
Fri Aug 3 16:29:43 CEST 2007


Try producing it in "long" format using aggregate and then reshaping
it into "wide" format using xtabs, reshape or the reshape package:

twas <- function(x) {
	y <- data.frame(timediff = diff(x$time), head(x, -1))
	aggregate(100 * y[1]/sum(y[1]), y[c("hour", "spread")], sum)
}
tmp2 <- cbind(tmp, hour = fmt(tmp$time))
long <- do.call("rbind", by(tmp2, tmp2["hour"], twas))

# any one of these three:

xtabs(timediff ~., long)

reshape(long, dir = "wide", timevar = "spread", idvar = "hour")

library(reshape)
cast(melt(long, id = 1:2), hour ~ spread)


On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> Hi
>
> Sorry, I'm not sure what happened with that last one. Here is a fully
> contained example (sorry about the line length if this doesnt wrap).
>
> tmp <- data.frame(
> time=c(1185882786,1185882790,1185882791,1185882791,1185882792,1185882795,1185882796,1185882797,1185882797,1185882798,1185882799,1185882800,1185882806,1185882807,1185882809,1185882810,1185882810,1185882811,1185882845,1185882846,1185882906,1185882918,1185882950,1185882951,1185882951,1185882952,1185882953,1185882954,1185882955,1185882956,1185882991,1185882991,1185882995,1185882996,1185882997,1185882997,1185882998,1185882998,1185882999,1185883003,1185883004,1185883006,1185883007,1185883025,1185883026,1185883086,1185883129,1185883129,1185883133,1185883133,1185883137,1185883137,1185883144,1185883145,1185883145,1185883148,1185883148,1185883149,1185883150,1185883151,1185883152,1185883154,1185883154,1185883155,1185883155,1185883175,1185883176,1185883179,1185883179,1185883180,1185883181,1185883181,1185883182,1185883186,1185883187,1185883191,1185883191,1185883200,1185883200,1185883211,1185883212,1185883214,1185883214,1185883215,1185883217,1185883218,1185883219,1185883279,1185883307,1185883307,1185883365,1185883366,1185883366,1185883367,1185883368,1185883368,1185883368,1185883369,1185883373,1185883376),
> spread=c(1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,1e-04,2e-04,1e-04,2e-04,1e-04,2e-04,3e-04,2e-04)
> )
>
> twas <- function (dat)
> {
>     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> sum)/sum(diff(dat$time)) * 100)
> }
>
> now <- Sys.time()
> epoch <- now - as.numeric(now)
>
> z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
>
> Cheers
> Rory
>
>
> On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
> > I still get no warning.  Please provide complete self contained input
> > and output.
> >
> > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> 1185882791,
> > +  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > +  2e-04, 3e-04))
> > >
> > > twas <-
> > +  function(dat) {
> > +    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > +  sum)/sum(diff(dat$time)) * 100.0)
> > + }
> > > now <- Sys.time()
> > > epoch <- now - as.numeric(now)
> > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > z
> >       1e-04    2e-04
> > 07 66.66667 33.33333
> > >
> > > R.version.string # XP
> > [1] "R version 2.5.1 (2007-06-27)"
> >
> >
> > On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> > > Hi
> > >
> > > I have figured out what causes the warning (and recycling), but I am not
> > > sure how I can fix it. After seeing that it seemed to work for you, I
> went
> > > back and tried working with different subsets of the data. I eventually
> > > found where it occurs - when we get a third unique spread value. To
> > > reproduce, just change the definition of tmp to be:
> > >
> > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> 1185882791,
> > >  1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > >  2e-04, 3e-04)) <== Added 3e-04
> > >
> > > i.e. I have just changed one of the spread values to be a third value -
> this
> > > seems to trigger the warning  "Warning message:number of columns of
> result
> > > is not a multiple of vector length (arg 3) in: rbind", and the
> recycling. I
> > > tried this on R 2.5.0 and 2.5.1
> > >
> > > Can anyone see what I am doing wrong here?
> > >
> > > Cheers
> > > Rory
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 8/3/07, Gabor Grothendieck < ggrothendieck at gmail.com> wrote:
> > > > Can you provide a reproducible example that exhibits the warning.
> > > > Redoing it in a more easily reproducible way and using the data
> > > > in your post gives me no warning
> > > >
> > > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> > > 1185882791,
> > > > + 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > > + 2e-04, 1e-04))
> > > > >
> > > > > twas <-
> > > > +  function(dat) {
> > > > +     data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > +  sum)/sum(diff(dat$time)) * 100.0)
> > > > + }
> > > > > now <- Sys.time()
> > > > > epoch <- now - as.numeric(now)
> > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > > z
> > > >       1e-04    2e-04
> > > > 07 66.66667 33.33333
> > > > > R.version.string # XP
> > > > [1] "R version 2.5.1 (2007-06-27)"
> > > >
> > > >
> > > > Here is input:
> > > >
> > > > tmp <- data.frame(time = c(1185882786, 1185882790, 1185882791,
> 1185882791,
> > > > 1185882792, 1185882795), spread = c(1e-04, 1e-04, 2e-04, 1e-04,
> > > > 2e-04, 1e-04))
> > > > twas <-
> > > > function(dat) {
> > > >    data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > sum)/sum(diff(dat$time)) * 100.0)
> > > > }
> > > > now <- Sys.time()
> > > > epoch <- now - as.numeric(now)
> > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > z
> > > > R.version.string # XP
> > > >
> > > >
> > > >
> > > > On 8/3/07, Rory Winston <rory.winston at gmail.com> wrote:
> > > > > Hi
> > > > >
> > > > > I've been wrestling with this a little bit, using the example in the
> > > email
> > > > > that Gabor pointed me to as a reference, and I think I have almost
> got
> > > what
> > > > > I want...however its still not quite right.
> > > > >
> > > > > I have a variable, tmp, with two dimensions: time and spread:
> > > > >
> > > > > > head(tmp$time)
> > > > > [1] 1185882786 1185882790 1185882791 1185882791 1185882792
> 1185882795
> > > > >
> > > > > > head(tmp$spread)
> > > > > [1] 1e-04 1e-04 2e-04 1e-04 2e-04 1e-04
> > > > > >
> > > > >
> > > > > I also have a function that calculates the time-weighted average
> spread:
> > > > >
> > > > > > twas
> > > > > function(dat) {
> > > > >   data.frame(tapply(diff(dat$time), head(dat$spread, -1),
> > > > > sum)/sum(diff(dat$time)) * 100.0)
> > > > > }
> > > > >
> > > > > I can combine them using as rbind() and by():
> > > > >
> > > > > z <- do.call("rbind", by(tmp, format(epoch + tmp$time, "%H"), twas))
> > > > >
> > > > > (epoch is just an instance of ISOdatetime)
> > > > >
> > > > > This gives me a warning:
> > > > >
> > > > > Warning message:
> > > > > number of columns of result
> > > > >        is not a multiple of vector length (arg 3) in: rbind(1, "12"
> = c(
> > > > > 91.99207541277 , 8.00792458723005), "13" = c(90.1884966797708,
> > > > >
> > > > > The output from the above command is almost exactly what I need,
> apart
> > > from
> > > > > the recycling:
> > > > >
> > > > >      1e-04     2e-04      3e-04        4e-04
> > > > > 12 91.99208  8.007925 91.9920754  8.007924587 <== recycled values
> > > > > 13 90.18850  9.337448  0.4218405  0.052214551
> > > > > 14 90.59640  9.171417  0.2321811 90.596401668
> > > > > 15 89.55771 10.194291  0.2343418  0.013661453
> > > > > ...
> > > > >
> > > > > I can just pass this into a barplot() and get a nice visual
> breakdown of
> > > > > hourly weighted spreads, *but* I dont know how to get these results
> > > without
> > > > > the recycling. Looking at rbind(), it seems that this will
> automatically
> > > > > recycle. Does anyone know of a function I could use to get these
> results
> > > > > without this problem?
> > > > >
> > > > > Cheers
> > > > > Rory
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 8/1/07, Gabor Grothendieck < ggrothendieck at gmail.com > wrote:
> > > > > >
> > > > > > Something similar was just discussed this morning:
> > > > > >
> > >
> https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/137695.html
> > > > > >
> > > > > >
> > > > > > On 8/1/07, Rory Winston <rory.winston at gmail.com > wrote:
> > > > > > > Hi all
> > > > > > >
> > > > > > > I have a question about aggegating statistics by time intervals.
> I
> > > have
> > > > > > a
> > > > > > > data set with 3 columns : time, bid, and ask. Time is specified
> as a
> > > > > > > millisecond timestamp since epoch. I would like to compute
> summary
> > > > > > > statistics for the data set on an hourly basis. Here is what I
> have
> > > > > > tried so
> > > > > > > far:
> > > > > > >
> > > > > > > # Data is in pricedata
> > > > > > >
> > > > > > > t <- ISODatetime(1970, 1, 1, 0, 0, 0) + pricedata$time
> > > > > > > agg <- aggregate(pricedata$spread, list(byhour=format(t, "%Y-%m
> > > %H")),
> > > > > > mean)
> > > > > > >
> > > > > > > This seems to do what I want - however, what really want to do
> is
> > > more
> > > > > > > specific: I would like to be able to extract a subset of the
> data
> > > frame
> > > > > > > pricedata, and not just the aggregated entries - for instance,
> > > instead
> > > > > > of
> > > > > > > just extracting pricedata$spread by hour, I would like to
> extract a
> > > > > > slice of
> > > > > > > columns, e.g. pricedata$spread and pricedata$time on an hourly
> > > basis,
> > > > > > and
> > > > > > > pass these into a function that can compute a time-weighted
> average
> > > > > > spread,
> > > > > > > for instance. Does anyone know an elegant way to do this? I have
> a
> > > > > > feeling
> > > > > > > zoo may do what I want, but I'm new to zoo ...
> > > > > > >
> > > > > > > Cheers
> > > > > > > Rory
> > > > > > >
> > > > > > >        [[alternative HTML version deleted]]
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > R-SIG-Finance at stat.math.ethz.ch mailing list
> > > > > > >
> https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> > > > > > > -- Subscriber-posting only.
> > > > > > > -- If you want to post, subscribe first.
> > > > > > >
> > > > > >
> > > > >
> > > > >        [[alternative HTML version deleted]]
> > > > >
> > > > > _______________________________________________
> > > > > R-SIG-Finance at stat.math.ethz.ch mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> > > > > -- Subscriber-posting only.
> > > > > -- If you want to post, subscribe first.
> > > > >
> > > >
> > >
> > >
> >
>
>



More information about the R-SIG-Finance mailing list