[R-SIG-Finance] question on zoo data manipulation

Gabor Grothendieck ggrothendieck at gmail.com
Thu Apr 17 03:17:43 CEST 2008


I thought about this one a bit more and have two
additional solutions.  One uses zoo more intensively
by forming the "time" index out of the merge keys.
This relies on the fact that zoo can use any class
with certain methods, not just the usual time/date
classes.  However, I think that the best wayof thinking
about this problem is from an SQL viewpoint since its
basically just a three way self merge followed by
an aggregation and the entire thing can be done
in a single SQL statement (although it spans several
lines).  Solution 1 is our prior minimally zoo solution,
solution 2 is the much more zoo-ish solution and
solution 3 uses sqldf to implement it in SQL.

DF <- structure(list(Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "XXX", class = "factor"), Date = c(20080320L,
20080320L, 20080320L, 20080320L, 20080320L, 20080218L, 20080218L,
20080218L, 20080218L), BrokerName = structure(c(1L, 1L, 2L, 2L,
3L, 1L, 1L, 2L, 2L), .Label = c("BRK1", "BRK2", "BRK3"), class = "factor"),
    Acc_Yr = c(200806L, 200906L, 200806L, 200906L, 200806L, 200806L,
    200906L, 200806L, 200906L), Measure = c(2.2, 2.5, 2.3, 2.8,
    3.3, 2.2, 2.5, 2.4, 2.8), lag = c(0L, 0L, 0L, 0L, 0L, 1L,
    1L, 1L, 1L)), .Names = c("Ticker", "Date", "BrokerName",
"Acc_Yr", "Measure", "lag"), class = "data.frame", row.names = c(NA,
-9L))


# zoo solution 1

library(zoo)

f <- function(x) {
    br <- intersect(x[x$lag == 0, "BrokerName"], x[x$lag == 1, "BrokerName"])
    sb <- subset(x, BrokerName %in% br)
    ag <- aggregate(sb["Measure"], sb[c(1, 2, 4, 6)], mean)
    transform(tail(ag, -1), Measure =
        coredata(diff(zoo(ag$Measure), arithmetic = FALSE) - 1))
}
do.call("rbind", by(DF, DF[c(1, 4)], f))

# zoo solution 2

library(zoo)

z <- zoo(DF$Measure, apply(DF[c(1, 3, 4, 6)], 1, paste, collapse = ":"))

zl <- zoo(DF$Measure,
    apply(transform(DF[c(1, 3, 4, 6)], lag = lag+1), 1, paste, collapse = ":"))

zm <- merge(z, zl, all = FALSE)

z01 <- zm[sub(":[0-9]*$", ":1", time(zm)) %in% time(zm)]

transform(aggregate(z01, sub(":[^:]*", "", time(z01)), mean), Change = z/zl-1)

# solution 3 - sqldf

library(sqldf)
sqldf("select Ticker, Date__1, Acc_Yr, lag, avg(Measure)/avg(Mprev)-1 Change
    from (select y.*, x.Measure Mprev from DF x, DF y, DF z
    where x.Ticker = y.Ticker and x.Acc_Yr = y.Acc_Yr
        and x.BrokerName = y.BrokerName and x.lag = y.lag - 1
        and x.Ticker = z.Ticker and x.Acc_Yr = z.Acc_Yr
        and x.BrokerName = z.BrokerName and z.lag = 1)
    group by Ticker, Acc_Yr, lag")



On Tue, Apr 15, 2008 at 8:56 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> This does not really use zoo in any significant way since it
> does not become a time series until the last line of f but
> here is a solution that does use zoo in that one last line.
>
> We use by to split up the data frame with a function f.
> In f, br intersects the lag0 and lag1 brokers and then we
> subset x according to those lines having a broker in br.
> We then take the means, convert the series to zoo and
> perform diff.zoo on it.
>
> There are some aspects of the problem that were not defined
> such as whether to use lag 1 to compare lag 3 if there is no
> lag 2 and we did that here but that could be changed by using
> the lag as the time index.
>
> We have also dumped out the data frame, DF, using dput to make
> it easier to reproduce the solution.
>
> DF <- structure(list(Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L), .Label = "XXX", class = "factor"), Date = c(20080320L,
> 20080320L, 20080320L, 20080320L, 20080320L, 20080218L, 20080218L,
> 20080218L, 20080218L), BrokerName = structure(c(1L, 1L, 2L, 2L,
> 3L, 1L, 1L, 2L, 2L), .Label = c("BRK1", "BRK2", "BRK3"), class = "factor"),
>    Acc_Yr = c(200806L, 200906L, 200806L, 200906L, 200806L, 200806L,
>    200906L, 200806L, 200906L), Measure = c(2.2, 2.5, 2.3, 2.8,
>    3.3, 2.2, 2.5, 2.4, 2.8), lag = c(0L, 0L, 0L, 0L, 0L, 1L,
>    1L, 1L, 1L)), .Names = c("Ticker", "Date", "BrokerName",
> "Acc_Yr", "Measure", "lag"), class = "data.frame", row.names = c(NA,
> -9L))
>
> library(zoo)
>
> f <- function(x) {
>    br <- intersect(x[x$lag == 0, "BrokerName"], x[x$lag == 1, "BrokerName"])
>    sb <- subset(x, BrokerName %in% br)
>    ag <- aggregate(sb["Measure"], sb[c(1, 2, 4, 6)], mean)
>    transform(tail(ag, -1), Measure =
>        coredata(diff(zoo(ag$Measure), arithmetic = FALSE) - 1))
> }
> do.call("rbind", by(DF, DF[c(1, 4)], f))
>
>
> On Mon, Apr 14, 2008 at 8:30 AM, Manoj <manojsw at gmail.com> wrote:
>
> > Hi Zoo-experts,
> >      I am working on the data-set below.
> >
> > Ticker  Date    BrokerName      Acc_Yr  Measure lag
> > XXX     20080320        BRK1    200806  2.2     0
> > XXX     20080320        BRK1    200906  2.5     0
> > XXX     20080320        BRK2    200806  2.3     0
> > XXX     20080320        BRK2    200906  2.8     0
> > XXX     20080320        BRK3    200806  3.3     0
> > XXX     20080218        BRK1    200806  2.2     1
> > XXX     20080218        BRK1    200906  2.5     1
> > XXX     20080218        BRK2    200806  2.4     1
> > XXX     20080218        BRK2    200906  2.8     1
> >
> >
> >
> > Using zoo object, Is there a quicker/efficient way of manipulating the
> > data as per following criteria?
> >
> > 1) For any given date/lag - compute mean of column "measure" grouped
> > by different broker & different accounting year?
> >          so the output data-set should look like:
> >
> > Ticker  Date    Mean Measure    Acc_Yr  Lag
> > XXX     20080320        2.6     200806  0
> >
> > 2) For any lag >= 1, calculate returns on  aggregate "measure"
> > constrained on "intersection" of broker-name across lag 0 & lag 1 (so
> > BRK3 should drop out) ?
> >
> > i.e:  the intermediate data-set should look like
> >
> > Ticker  Date    Mean Measure    Acc_Yr  Lag
> > XXX     20080320        2.25    200806  0
> > XXX     20080318        2.3     200806  1
> >
> >
> > Note that for 200806, the mean changes from 2.6 as measured above to
> > 2.25 (since BRK3 is dropped in calculation.  The final data-set should
> > then be:
> >
> > Ticker  Date    Pct_Change      Acc_Yr  Lag
> > XXX     20080218        0.02    200806  1
> >
> > --------------------
> >
> > I can accomplish the results using a combination of tapply &
> > subsetting the data-set for each lag but I thought this kind of
> > data-structure is ideal for zoo manipulation, hence the help request.
> >
> > Thanks in Advance.
> >
> > Manoj
> >
> > _______________________________________________
> > R-SIG-Finance at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-finance
> > -- Subscriber-posting only.
> > -- If you want to post, subscribe first.
> >
>



More information about the R-SIG-Finance mailing list