# [R] Speeding up "accumulation" code in large matrix calc?

Petr Savicky savicky at cs.cas.cz
Fri Feb 24 21:55:46 CET 2012

```On Fri, Feb 24, 2012 at 09:06:28PM +0100, Berend Hasselman wrote:
[...]
> I've done some speed tests.
>
> t1 <- function(m, outcome) {
> 	nrows <- dim(m)[1]
> 	ncols <- dim(m)[2]
> 	res <- matrix(rep.int(0, nrows*ncols), nrow=nrows)
> 	for(row in 1:nrows) {
> 		for(col in 2:ncols) {
> 			res[row,col] <- if(m[row,col-1]==outcome) {0} else {1+res[row,col-1]}
> 		}
> 	}
> 	res
> }
>
> # by row
> t2 <- function(A, outcome) {
>     oneRow <- function(x, outcome)
>     {
>         n <- length(x)
>         y <- c(0, cumsum(x[-n] == outcome))
>         ave(x, y, FUN = function(z) seq.int(along=z) - 1)
>     }
>
>     t(apply(A, 1, oneRow, outcome=1))
> }
>
> # transformation
> t3 <- function(A, outcome) {
>     B <- array(0, dim=dim(A))
>     curr <- B[, 1]
>     for (i in seq.int(from=2, length=ncol(A)-1)) {
>         curr <- ifelse (A[, i-1] == outcome, 0, 1 + curr)
>         B[, i] <- curr
>     }
>     B
> }
>
> # Compile functions to byte code
> library(compiler)
> t1.c <- cmpfun(t1)
> t2.c <- cmpfun(t2)
> t3.c <- cmpfun(t3)
>
> Nrow <- 100
> Ncol <- 1000
> A <- matrix((runif(Ncol*Nrow)<0.2)+0, nrow=Nrow)
>
> z1 <- system.time(res1 <- t1(A,outcome=1))
> z2 <- system.time(res2 <- t2(A,outcome=1))
> z3 <- system.time(res3 <- t3(A, outcome=1))
> z4 <- system.time(res4 <- t1.c(A,outcome=1))
> z5 <- system.time(res5 <- t2.c(A,outcome=1))
> z6 <- system.time(res6 <- t3.c(A, outcome=1))
> print(all(res2==res1))
> print(all(res3==res1))
> print(all(res4==res1))
> print(all(res5==res1))
> print(all(res6==res1))
> print(rbind(z1,z2,z3,z4,z5,z6))
>
[...]
> ------------------------------------------
>
> Summary of results
>
> > Nrow <- 100
> > Ncol <- 1000
> > A <- matrix((runif(Ncol*Nrow)<0.2)+0, nrow=Nrow)
> > print(rbind(z1,z2,z3,z4,z5,z6))
>    user.self sys.self elapsed user.child sys.child
> z1     0.543    0.005   0.551          0         0
> z2     0.299    0.004   0.304          0         0
> z3     0.070    0.012   0.082          0         0
> z4     0.112    0.002   0.114          0         0
> z5     0.263    0.007   0.271          0         0
> z6     0.062    0.009   0.070          0         0
>
>
> > Nrow <- 1000
> > Ncol <- 100
> > A <- matrix((runif(Ncol*Nrow)<0.2)+0, nrow=Nrow)
> >
> > z1 <- system.time(res1 <- t1(A,outcome=1))
> > z2 <- system.time(res2 <- t2(A,outcome=1))
> > z3 <- system.time(res3 <- t3(A, outcome=1))
> > z4 <- system.time(res4 <- t1.c(A,outcome=1))
> > z5 <- system.time(res5 <- t2.c(A,outcome=1))
> > z6 <- system.time(res6 <- t3.c(A, outcome=1))
>    user.self sys.self elapsed user.child sys.child
> z1     0.543    0.005   0.551          0         0
> z2     0.299    0.004   0.304          0         0
> z3     0.070    0.012   0.082          0         0
> z4     0.112    0.002   0.114          0         0
> z5     0.263    0.007   0.271          0         0
> z6     0.062    0.009   0.070          0         0
[...]
>
> The original function (t1) benefits a lot from compiling to byte code.
> The function that operates on columns (t3) seems to be always the quickest in this experiment.

Thank you for this. Intuition may be wrong.

Function t2() calls ave(), which contains a loop over the groups, so
its efficiency depends also on the number of groups. If the matrix is
generated with fewer 1's, then i obtained

Nrow <- 100
Ncol <- 1000
A <- matrix((runif(Ncol*Nrow)<0.002)+0, nrow=Nrow)

library(rbenchmark)
benchmark(t1(A,outcome=1),
t2(A,outcome=1),
t3(A,outcome=1),
t1.c(A,outcome=1),
t2.c(A,outcome=1),
t3.c(A,outcome=1),
columns=c("test", "user.self", "relative"),
replications=1)

1   t1(A, outcome = 1)     0.776 13.362069
4 t1.c(A, outcome = 1)     0.308  5.275862
2   t2(A, outcome = 1)     0.172  2.965517
5 t2.c(A, outcome = 1)     0.176  3.034483
3   t3(A, outcome = 1)     0.060  1.051724
6 t3.c(A, outcome = 1)     0.056  1.000000

So, the result is slightly better for t2() with probability 0.002
of 1 than with probability 0.2. At least, it is faster than t1.c().
But still, t3() or t3.c() is faster.

Petr.

```