[R] aggregate slow with variables of type 'dates' - how to solve
Christoph Lehmann
christoph.lehmann at gmx.ch
Sat Apr 16 01:22:34 CEST 2005
Dear all
I use aggregate with variables of type numeric and dates. For type numeric
functions, such as sum() are very fast, but similar simple functions, such
as min() are much slower for the variables of type 'dates'. The difference
gets bigger the larger the 'id' var is - but see this sample code:
dts <- dates(c("02/27/92", "02/27/92", "01/14/92",
"02/28/92", "02/01/92"))
ntimes <- 700000
dts <- data.frame(rep(c(1:40), ntimes/8),
chron(rep(dts, ntimes), format = c(dates = "m/d/y")),
rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes))
names(dts) <- c("id", "date", "tbs")
date()
dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x
dat.1st <- chron(dat.1st, format = c(dates = "m/d/y"))
dat.1st
date() #82 seconds
date()
tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum)
tbs.s
date() #17 seconds
--- is it a problem of data-type 'dates' ? if yes, is there any solution
to solve this, since for huge data-sets, this can be a problem...
as I mentioned, e.g. if we have for variable 'id' eg just 5 levels, the
two times are roughly the same, but with the 40 different ids, we have
this big difference
thanks a lot
Christoph
--
More information about the R-help
mailing list