[R] speeding up applying hist() over rows of a matrix

William Dunlap wdunlap at tibco.com
Fri May 2 18:23:12 CEST 2014


Your original code, as a function of 'm' and 'bins' is
f0 <- function (m, bins) {
    t(apply(m, 1, function(x) hist(x, breaks = bins, plot = FALSE)$counts))
}
and the time it takes to run on your m1 is about 5 s. on my machine
> system.time(r0 <- f0(m1,bins))
   user  system elapsed
   4.95    0.00    5.02


hist(x,breaks=bins) is essentially tabulate(cut(x,bins),nbins=length(bins)-1).
See how much it speeds things up by replacing hist() with tabulate(cut()):
f1 <- function (m, bins)
{
    nbins <- length(bins) - 1L
    t(apply(m, 1, function(x) tabulate(cut(x, bins), nbins = nbins)))
}
That doesn't help with the time, but it does give the same output
> system.time(r1 <- f1(m1,bins))
   user  system elapsed
   4.85    0.10    5.35
> identical(r0, r1)
[1] TRUE

Now try speeding it up by calling cut() on the whole matrix first
and then applying tabulate to each row, as in
f2 <- function (m, bins)  {
    nbins <- length(bins) - 1L
    m <- array(as.integer(cut(m, bins)), dim = dim(m))
    t(apply(m, 1, tabulate, nbins = nbins))
}
That saves quite a bit of time and gives the same output
> system.time(r2 <- f2(m1,bins))
   user  system elapsed
   0.25    0.00    0.25
> identical(r0, r2)
[1] TRUE

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, May 1, 2014 at 12:48 PM, Ortiz-Bobea, Ariel <Ortiz-Bobea at rff.org> wrote:
> Hello everyone,
>
>
>
> I'm trying to construct bins for each row in a matrix. I'm using apply() in combination with hist() to do this. Performing this binning for a 10K-by-50 matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500 matrix. This suggests the bottleneck is accessing rows in apply() rather than the calculations going on inside hist().
>
>
>
> My initial idea is to process as many columns (as make sense for the intended use) at once. However, I still have many many rows to process and I would appreciate any feedback on how to speed this up.
>
>
>
> Any thoughts?
>
>
>
> Thanks,
>
>
>
> Ariel
>
>
>
> Here is the illustration:
>
>
>
> # create data
>
> m1 <- matrix(10*rnorm(50*10^4), ncol=50)
>
> m2 <- matrix(10*rnorm(50*10^4), ncol=500)
>
>
>
> # compute bins
>
> bins <- seq(-100,100,1)
>
> system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) })
>
> system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) })
>
>
> ---
> Ariel Ortiz-Bobea
> Fellow
> Resources for the Future
> 1616 P Street, N.W.
> Washington, DC 20036
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list