[R] How to speed up grouping time series, help please
Joshua Ulrich
josh.m.ulrich at gmail.com
Mon Apr 4 16:07:58 CEST 2011
Hi Dan,
On Mon, Apr 4, 2011 at 7:49 AM, Den Alpin <den.alpin at gmail.com> wrote:
> I retrieve for a few hundred times a group of time series (10-15 ts
> with 10000 values each), on every group I do some calculation, graphs
> etc. I wonder if there is a faster method than what presented below to
> get an appropriate timeseries object.
>
> Making a query with RODBC for every group I get a data frame like this:
>
>> X
> ID DATE VALUE
> 14 3 2000-01-01 00:00:03 0.5726334
> 4 1 2000-01-01 00:00:03 0.8830174
> 1 1 2000-01-01 00:00:00 0.2875775
> 15 3 2000-01-01 00:00:04 0.1029247
> 11 3 2000-01-01 00:00:00 0.9568333
> 9 2 2000-01-01 00:00:03 0.5514350
> 7 2 2000-01-01 00:00:01 0.5281055
> 6 2 2000-01-01 00:00:00 0.0455565
> 12 3 2000-01-01 00:00:01 0.4533342
> 8 2 2000-01-01 00:00:02 0.8924190
> 3 1 2000-01-01 00:00:02 0.4089769
> 13 3 2000-01-01 00:00:02 0.6775706
>
> And I want to get a timeSeries object or xts object like this:
>
> 1 2 3
> 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333
> 2000-01-01 00:00:01 NA 0.5281055 0.4533342
> 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706
> 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334
> 2000-01-01 00:00:04 NA NA 0.1029247
>
>
> Input data can be sorted or unsorted (the most complicated case is in
> the example, unsorted and missing data) in the sense that I can sort
> in query if I can take an advantage from this.
>
> Some considerations:
> - Xts is generally faster than timeSeries
> - both accept a matrix so if I can create a matrix like the one
> represented above and an array of characters representing dates faster
> than what possible with xts:::cbind, for examole,I will have a faster
> implementation (package data.table ?).
> - create timeseries objects in multithread and then merge (package plyr ?)
> - faster merge algorithms?
>
> Below some code to generate the test case above:
>
>
> set.seed(123)
> N <- 5 # number of observations
> K <- 3 # number of timeseries ID
>
> X <- data.frame(
> ID = rep(1:K, each = N),
> DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)),
> VALUE = runif(N*K), stringsAsFactors = FALSE)
>
> X <- X[sample(1:(N*K), N*K),] # sample observations to get random
> order (optional)
> X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing
>
> head(X, 15)
>
> # use explicitly environments to avoid '<<-'
> buildTimeSeriesFromDataFrame <- function(x, env)
> {
> {
> if(exists("xx", envir = env)) # if exist variable xx in env cbind
> assign("xx",
> cbind(get("xx", env), timeSeries(x$VALUE, x$DATE,
> format = '%Y-%m-%d %H:%M:%S',
> zone = 'GMT', units = as.character(x$ID[1]))),
> envir = env)
> else # create xx in env
> assign("xx",
> timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S',
> zone = 'GMT', units = as.character(x$ID[1])),
> envir = env)
>
> return(TRUE)
> }
> }
>
> # use package plyr, faster than 'by' function
> tsDaply <- function(...)
> {
> library(plyr)
> e1 <- new.env(parent = baseenv()) #create a new env
> res <- daply(X, "ID", buildTimeSeriesFromDataFrame,
> env = e1)
> return(get("xx", e1)) # return xx from env
> }
>
> ##replicate 100 times
> #Time03 <- replicate(100,
> # system.time(tsDaply(X, X$ID))[[1]])
> #median(Time03)
>
> # result
> tsDaply(X, X$ID)
>
>
> Thanks in advance for any input, best regards,
> Den
>
>
Here's how I would do it with xts:
x <- xts(X[,c("ID","VALUE")], as.POSIXct(X[,"DATE"]))
do.call(merge, split(x$VALUE,x$ID))
My xts solution compares favorably to your solution:
> Time03 <- replicate(100,
+ system.time(tsDaply(X, X$ID))[[1]])
> median(Time03)
[1] 0.02
> xtsTime <- replicate(100,
+ system.time(do.call(merge, split(x$VALUE,x$ID)))[[1]])
> median(xtsTime)
[1] 0
Best,
--
Joshua Ulrich | FOSS Trading: www.fosstrading.com
More information about the R-help
mailing list