[R] aggregate slow with variables of type 'dates' - how to solve
Gabor Grothendieck
ggrothendieck at gmail.com
Sat Apr 16 06:13:46 CEST 2005
On 4/15/05, Christoph Lehmann <christoph.lehmann at gmx.ch> wrote:
> Dear all
> I use aggregate with variables of type numeric and dates. For type numeric
> functions, such as sum() are very fast, but similar simple functions, such
> as min() are much slower for the variables of type 'dates'. The difference
> gets bigger the larger the 'id' var is - but see this sample code:
>
> dts <- dates(c("02/27/92", "02/27/92", "01/14/92",
> "02/28/92", "02/01/92"))
> ntimes <- 700000
> dts <- data.frame(rep(c(1:40), ntimes/8),
> chron(rep(dts, ntimes), format = c(dates = "m/d/y")),
> rep(c(0.123, 0.245, 0.423, 0.634, 0.256), ntimes))
> names(dts) <- c("id", "date", "tbs")
>
> date()
> dat.1st <- aggregate(dts$date, list(id = dts$id), min)$x
> dat.1st <- chron(dat.1st, format = c(dates = "m/d/y"))
> dat.1st
> date() #82 seconds
>
> date()
> tbs.s <- aggregate(as.numeric(dts$tbs),list(id = dts$id), sum)
> tbs.s
> date() #17 seconds
>
> --- is it a problem of data-type 'dates' ? if yes, is there any solution
> to solve this, since for huge data-sets, this can be a problem...
>
> as I mentioned, e.g. if we have for variable 'id' eg just 5 levels, the
> two times are roughly the same, but with the 40 different ids, we have
> this big difference
Just convert the dates to numeric first. You are converting
them back anyways.
> system.time({
+ dat.1st <- chron(aggregate(dts$date, list(id = dts$id), min)$x)
+ }, TRUE)
[1] 0.86 0.00 0.86 NA NA
> system.time({
+ dat.1st.2 <- chron(aggregate(as.numeric(dts$date), list(id = dts$id), min)$x)
+ }, TRUE)
[1] 0.12 0.00 0.12 NA NA
>
> identical(dat.1st, dat.1st.2)
[1] TRUE
>
More information about the R-help
mailing list