[R] How to speed up grouping time series, help please

Mon Apr 4 17:20:50 CEST 2011

I did some tests on Your and Gabor solutions, below my findings:

- Your solution is fast as my solution in xts (below) but MUCH MORE
READABLE, in particular I think your test should take into account xts
creation from the data.frame (see below);
- Gabor's solution with read.zoo is fast as xts but give an xts object
that has some problems with time zones.

Any better idea to speed up grouping time series?

Thanks!

Below few line of codes to test (I suggest to grow X size to get
better comparison results):
xtsSplit <- function(x)
{
  x <- xts(x[,c("ID","VALUE")], as.POSIXct(x[,"DATE"]))
  x <- do.call(merge, split(x$VALUE,x$ID))
  return(x)
}

xtsSplitTime <- replicate(100,
  system.time(xtsSplit(X))[[1]])
median(xtsTime)

zooReadTime <- replicate(100,
 system.time(z <- read.zoo(X, split = 1, index = 2, tz = ""))[[1]])
median(zooReadTime)

And my (unreadable) solution:
library(xts)
buildXtsFromDataFrame <- function(x, env)
{
  {
    if(exists("xx", envir = env))
    {
      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        cbind(get("xx", env), xts(VALUE,
          as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT')),
        envir = env)
    } else
    {
      VALUE <- as.matrix(x$VALUE)
      colnames(VALUE) <- as.character(x$ID[1])
      assign("xx",
        xts(VALUE, as.POSIXct(x$DATE, tz = "GMT",
            format = '%Y-%m-%d %H:%M:%S'),
          tzone = 'GMT'),
        envir = env)
    }
    return(TRUE)
  }
}

xtsDaply <- function(...)
{
  e1 <- new.env(parent = baseenv())
  res <- daply(X, "ID", buildXtsFromDataFrame,
      env = e1)
  return(get("xx", e1))
}

Time04 <- replicate(100,
  system.time(xtsDaply(X, X$ID))[[1]])

2011/4/4 Joshua Ulrich <josh.m.ulrich at gmail.com>:
> Hi Dan,
>
> On Mon, Apr 4, 2011 at 7:49 AM, Den Alpin <den.alpin at gmail.com> wrote:
>> I retrieve for a few hundred times a group of time series (10-15 ts
>> with 10000 values each), on every group I do some calculation, graphs
>> etc. I wonder if there is a faster method than what presented below to
>> get an appropriate timeseries object.
>>
>> Making a query with RODBC for every group I get a data frame like this:
>>
>>> X
>>   ID                DATE     VALUE
>> 14  3 2000-01-01 00:00:03 0.5726334
>> 4   1 2000-01-01 00:00:03 0.8830174
>> 1   1 2000-01-01 00:00:00 0.2875775
>> 15  3 2000-01-01 00:00:04 0.1029247
>> 11  3 2000-01-01 00:00:00 0.9568333
>> 9   2 2000-01-01 00:00:03 0.5514350
>> 7   2 2000-01-01 00:00:01 0.5281055
>> 6   2 2000-01-01 00:00:00 0.0455565
>> 12  3 2000-01-01 00:00:01 0.4533342
>> 8   2 2000-01-01 00:00:02 0.8924190
>> 3   1 2000-01-01 00:00:02 0.4089769
>> 13  3 2000-01-01 00:00:02 0.6775706
>>
>> And I want to get a timeSeries object or xts object like this:
>>
>>                            1         2         3
>> 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
>> 2000-01-01 00:00:01        NA 0.5281055 0.4533342
>> 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
>> 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
>> 2000-01-01 00:00:04        NA        NA 0.1029247
>>
>>
>> Input data can be sorted or unsorted (the most complicated case is in
>> the example, unsorted and missing data) in the sense that I can sort
>> in query if I can take an advantage from this.
>>
>> Some considerations:
>> - Xts is generally faster than timeSeries
>> - both accept a matrix so if I can create a matrix like the one
>> represented above and an array of characters representing dates faster
>> than what possible with xts:::cbind, for examole,I will have a faster
>> implementation (package data.table ?).
>> - create timeseries objects in multithread and then merge (package plyr ?)
>> - faster merge algorithms?
>>
>> Below some code to generate the test case above:
>>
>>
>> set.seed(123)
>> N <- 5 # number of observations
>> K <- 3 # number of timeseries ID
>>
>> X <- data.frame(
>>  ID = rep(1:K, each = N),
>>  DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
>>  VALUE = runif(N*K), stringsAsFactors = FALSE)
>>
>> X <- X[sample(1:(N*K), N*K),] # sample observations to get random
>> order (optional)
>> X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing
>>
>> head(X, 15)
>>
>> # use explicitly environments to avoid '<<-'
>> buildTimeSeriesFromDataFrame <- function(x, env)
>> {
>>  {
>>    if(exists("xx", envir = env)) # if exist variable xx in env cbind
>>      assign("xx",
>>        cbind(get("xx", env), timeSeries(x$VALUE, x$DATE,
>>          format = '%Y-%m-%d %H:%M:%S',
>>          zone = 'GMT', units = as.character(x$ID[1]))),
>>        envir = env)
>>    else  # create xx in env
>>      assign("xx",
>>        timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S',
>>          zone = 'GMT', units = as.character(x$ID[1])),
>>        envir = env)
>>
>>    return(TRUE)
>>  }
>> }
>>
>> # use package plyr, faster than 'by' function
>> tsDaply <- function(...)
>> {
>>  library(plyr)
>>  e1 <- new.env(parent = baseenv()) #create a new env
>>  res <- daply(X, "ID", buildTimeSeriesFromDataFrame,
>>      env = e1)
>>  return(get("xx", e1)) # return xx from env
>> }
>>
>> ##replicate 100 times
>> #Time03 <- replicate(100,
>> #  system.time(tsDaply(X, X$ID))[[1]])
>> #median(Time03)
>>
>> # result
>> tsDaply(X, X$ID)
>>
>>
>> Thanks in advance for any input, best regards,
>> Den
>>
>>
>
> Here's how I would do it with xts:
>
> x <- xts(X[,c("ID","VALUE")], as.POSIXct(X[,"DATE"]))
> do.call(merge, split(x$VALUE,x$ID))
>
> My xts solution compares favorably to your solution:
>> Time03 <- replicate(100,
> +   system.time(tsDaply(X, X$ID))[[1]])
>> median(Time03)
> [1] 0.02
>> xtsTime <- replicate(100,
> +   system.time(do.call(merge, split(x$VALUE,x$ID)))[[1]])
>> median(xtsTime)
> [1] 0
>
> Best,
> --
> Joshua Ulrich  |  FOSS Trading: www.fosstrading.com
>