# [Rd] Any interest in "merge" and "by" implementations specifically for so

tshort tshort at eprisolutions.com
Sun Jul 30 14:02:19 CEST 2006

```Kevin, starting with your idea of sorting first, you can get some speedups
just using R. Start by comparing the base case that Bill used:

> x <- runif(2e6)
> i <- rep(1:1e6, 2)
> unix.time(res0 <- unlist(lapply(split(x,i), sum)))
 11.00  0.16 11.28    NA    NA

Now, try sorting and using a loop:

> idx <- order(i)
> xs <- x[idx]
> is <- i[idx]
> res <- array(NA, 1e6)
> idx <- which(diff(is) > 0)
> startidx <- c(1, idx+1)
> endidx <- c(idx, length(xs))
> f1 <- function(x, startidx, endidx, FUN = sum)  {
+   for (j in 1:length(res)) {
+     res[j] <- FUN(x[startidx[j]:endidx[j]])
+   }
+   res
+ }
> unix.time(res1 <- f1(xs, startidx, endidx))
 6.86 0.00 7.04   NA   NA

For the case of sum (or averages), you can vectorize this using cumsum as
follows. This won't work for median or max.

> f2 <- function(x, startidx, endidx)  {
+   cum <- cumsum(x)
+   res <- cum[endidx]
+   res[2:length(res)] <- res[2:length(res)] - cum[endidx[1:(length(res) -
1)]]
+   res
+ }
> unix.time(res2 <- f2(xs, startidx, endidx))
 0.20 0.00 0.21   NA   NA

You can also use Luke Tierney's byte compiler
(http://www.stat.uiowa.edu/~luke/R/compiler/) to speed up the loop for
functions where you can't vectorize:

> library(compiler)
> f3 <- cmpfun(f1)
Note: local functions used: FUN
> unix.time(res3 <- f3(xs, startidx, endidx))
 3.84 0.00 3.91   NA   NA

- Tom

--
View this message in context: http://www.nabble.com/Any-interest-in-%22merge%22-and-%22by%22-implementations-specifically-for-sorted-data--tf2009595.html#a5562816
Sent from the R devel forum at Nabble.com.

```