[R] templated use of aggregate

Matthew Johnson mcooganj at gmail.com
Wed Jun 13 15:38:09 CEST 2012


Sorry, i'll try and put more flesh on the bones.

please note, i changed the data in the example, as fiddling has raised
another question that's best illustrated with a slightly different
data set.

first of all, when i do as you suggest, i obtain the following error:

> PxMat <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)

Error in aggregate.formula(mm[, -1] ~ mm[, 1], data = mm, sum) :
  'names' attribute [3] must be the same length as the vector [1]

my data.frame is an xts, and it looks like this:

                    px_ym1 vol_ym1
2012-06-01 09:30:00  97.90       9
2012-06-01 09:30:00  97.90      60
2012-06-01 09:30:00  97.90      71
2012-06-01 09:30:00  97.90       5
2012-06-01 09:30:00  97.90       3
2012-06-01 09:30:00  97.90      21
2012-06-01 09:31:00  97.90       5
2012-06-01 09:31:00  97.89     192
2012-06-01 09:31:00  97.89      65
2012-06-01 09:31:00  97.89      73
2012-06-01 09:31:00  97.89       1
2012-06-01 09:31:00  97.89       1
2012-06-01 09:31:00  97.89      39
2012-06-01 09:31:00  97.90      15
2012-06-01 09:31:00  97.90       1
2012-06-01 09:31:00  97.89       1
2012-06-01 09:31:00  97.90      18
2012-06-01 09:31:00  97.89       1
2012-06-01 09:32:00  97.89      33
2012-06-01 09:34:00  97.89       1
2012-06-01 09:34:00  97.89       1

dput(mn) returns:

> dput(mn)
structure(c(97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.9, 97.89,
97.89, 97.89, 97.89, 97.89, 97.89, 97.9, 97.9, 97.89, 97.9, 97.89,
97.89, 97.89, 97.89, 9, 60, 71, 5, 3, 21, 5, 192, 65, 73, 1,
1, 39, 15, 1, 1, 18, 1, 33, 1, 1), .indexCLASS = c("POSIXct",
"POSIXt"), .indexTZ = "GMT", class = c("xts", "zoo"), index =
structure(c(1338543000,
1338543000, 1338543000, 1338543000, 1338543000, 1338543000, 1338543060,
1338543060, 1338543060, 1338543060, 1338543060, 1338543060, 1338543060,
1338543060, 1338543060, 1338543060, 1338543060, 1338543060, 1338543120,
1338543240, 1338543240), tzone = "GMT", tclass = c("POSIXct",
"POSIXt")), .Dim = c(21L, 2L), .Dimnames = list(NULL, c("px_ym1",
"vol_ym1")))

as you can see, the xts data.frame xts data.frame that contains dates,
prices and volumes. There is much more data over a long time period,
and i'm interested in various sub-setting and then aggregate
operations.

I would like to split the data by time period and aggregate the data,
such that i obtain a table which reports the volume traded at each
price, for each of the time-period splits that i have chosen.

I have employed the following approach:

PxMat <- aggregate(.~px_ym1, data=mn, sum)


which yields:

  px_ym1 vol_ym1
1  97.89     408
2  97.90     208

and for subsets, i use the following grouping:

>PxMat30 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '30'], sum)

Which yields:

  px_ym1 vol_ym1
1   97.9     169

and

> PxMat31 <- aggregate(.~px_ym1, data=mn[.indexmin(mn) == '31'], sum)

which yields:

  px_ym1 vol_ym1
1  97.89     373
2  97.90      39

and so on and so forth for each minute.

when i try and sub-set using general notation, as follows:

PxMat <- aggregate(.~mn[,1], data=mn, sum)

this yields a different form of output:

px_ym1  px_ym1 vol_ym1
1  97.90 1076.79     408
2  97.89  979.00     208

the problem is that i now have the sum of the px_ym1 data (the sum of mn[,1])

hopefully things are now clearer - sorry to have wasted your time up
until now.

assuming that i have now made my situation clear, i am hope you can
help with four specific questions.

1/ My data-sets are HUGE, so speed is an issue - is this the fastest
way to sub-set and aggregate an xts?

2/ is there a way to do this for multiple splits? say a table for each
minute, day, week, or month? the return would potentially be a list
with a table for each day / minute etc showing volume traded at each
price -- but it doesn't have to be a list ...

i am writing a function with loops that would generate a table that
reports volume traded at each price for each case of a specified time
split (say for four tables, one for each minute in the example data,
returned as a list). my solution is slow, it seems like something that
someone would have done better already. is this the case?

3/ is there a way to do the sub-setting with templated variables? i
would like to obtain the table i get with the named aggregate
functions (reproduced above) with multiple data frames, as the column
names will differ from time to time. i cannot figure out how to stop
the command from summing the mn[,1] column when i stop using variable
names.

4/ on a related note, is it possible to apply different functions to
different columns of data? It would be nice, for example, if the table
returned from an aggregate command could be made to be:

px_ym1  count vol_ym1
1  97.90  11     408
2  97.89  10     208

where we have the price traded, the number of trades (a count of
px_ym1 / mn[,1], and the sum of vol_ym1 (mn[,2]).

thanks and best regards

matt johnson

On 13 June 2012 15:06, David Winsemius <dwinsemius at comcast.net> wrote:
>
>
> On Jun 12, 2012, at 11:32 PM, Matthew Johnson wrote:
>
>> Dear R-help,
>>
>> I have an xts data set that i have subset by date.
>>
>> now it contains a date-time-stamp, and two columns (price and volume
>> traded): my objective is to create tables of volume traded at a price - and
>> i've been successfully using aggregate to do so in interactive use.
>>
>> say the data looks as follows:
>>
>>                   px_ym1 vol_ym1
>> 2012-06-01 09:37:00  97.91     437
>> 2012-06-01 09:37:00  97.91      64
>> 2012-06-01 09:37:00  97.91       1
>> 2012-06-01 09:37:00  97.91       5
>> 2012-06-01 09:37:00  97.91       5
>> 2012-06-01 09:37:00  97.92     174
>> 2012-06-01 09:37:00  97.92      64
>> 2012-06-01 09:37:00  97.92     125
>> 2012-06-01 09:37:00  97.92     124
>> 2012-06-01 09:37:00  97.92      64
>> 2012-06-01 09:37:00  97.92     109
>> 2012-06-01 09:37:00  97.92      64
>> 2012-06-01 09:37:00  97.92      19
>> 2012-06-01 09:37:00  97.92      45
>> 2012-06-01 09:37:00  97.92      75
>> 2012-06-01 09:37:00  97.92       3
>> 2012-06-01 09:37:00  97.92      47
>> 2012-06-01 09:37:00  97.91      26
>> 2012-06-01 09:37:00  97.92       4
>> 2012-06-01 09:37:00  97.92       1
>>
>> the the following gives me what i'm looking for:
>>
>>> adf <- aggregate(.~px_ym1, data=mm, sum)
>>
>>
>> which is this table:
>>
>> px_ym1 vol_ym1
>> 1  97.91     538
>> 2  97.92     918
>>
>> however now i'm trying to code it to run automatically, and use of the
>> templated version:
>>
>>> adf <- aggregate(.~mm[,1], data=mm, sum)
>
>
> Did you try:
>
> adf <- aggregate(mm[,-1] ~ mm[,1], data=mm, sum)
> adf
>
> I would have used names:
>
> adf <- aggregate(vol_ym1 ~ px_ym1, data=mm, sum)
> adf
>
>
>
>> yields the following - which contains what i'd like, but is has also summed
>> across the price column (not ideal).
>>
>>  px_ym1  px_ym1 vol_ym1
>> 1  97.91  587.46     538
>> 2  97.92 1370.88     918
>>
>> how do i code this so that i can enter an xts data-frame with arbitrary
>> names and still obtain the table with only the information i desire?
>
>
> That is too far to the vague side of the vague-specific continuum.
>
>
>>
>> on a related point, is there a way to combine the two steps?
>
>
> Er, which two steps would that be?
>
>
>> the function
>> i've written splits by date and then returns a list containing data-frames
>> that report the volume traded at each price on each date
>>
>> - am i re-creating the wheel here? is there canned function that does this?
>>
>> thanks + best regards
>>
>> matt johnson
>
>
>
> David Winsemius, MD
> West Hartford, CT
>



More information about the R-help mailing list