[R] speeding up applying hist() over rows of a matrix
William Dunlap
wdunlap at tibco.com
Fri May 2 18:23:12 CEST 2014
Your original code, as a function of 'm' and 'bins' is
f0 <- function (m, bins) {
t(apply(m, 1, function(x) hist(x, breaks = bins, plot = FALSE)$counts))
and the time it takes to run on your m1 is about 5 s. on my machine
> system.time(r0 <- f0(m1,bins))
user system elapsed
4.95 0.00 5.02
hist(x,breaks=bins) is essentially tabulate(cut(x,bins),nbins=length(bins)-1).
See how much it speeds things up by replacing hist() with tabulate(cut()):
f1 <- function (m, bins)
nbins <- length(bins) - 1L
t(apply(m, 1, function(x) tabulate(cut(x, bins), nbins = nbins)))
That doesn't help with the time, but it does give the same output
> system.time(r1 <- f1(m1,bins))
user system elapsed
4.85 0.10 5.35
> identical(r0, r1)
[1] TRUE
Now try speeding it up by calling cut() on the whole matrix first
and then applying tabulate to each row, as in
f2 <- function (m, bins) {
nbins <- length(bins) - 1L
m <- array(as.integer(cut(m, bins)), dim = dim(m))
t(apply(m, 1, tabulate, nbins = nbins))
That saves quite a bit of time and gives the same output
> system.time(r2 <- f2(m1,bins))
user system elapsed
0.25 0.00 0.25
> identical(r0, r2)
[1] TRUE
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Thu, May 1, 2014 at 12:48 PM, Ortiz-Bobea, Ariel <Ortiz-Bobea at rff.org> wrote:
> Hello everyone,
> I'm trying to construct bins for each row in a matrix. I'm using apply() in combination with hist() to do this. Performing this binning for a 10K-by-50 matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500 matrix. This suggests the bottleneck is accessing rows in apply() rather than the calculations going on inside hist().
> My initial idea is to process as many columns (as make sense for the intended use) at once. However, I still have many many rows to process and I would appreciate any feedback on how to speed this up.
> Any thoughts?
> Thanks,
> Ariel
> Here is the illustration:
> # create data
> m1 <- matrix(10*rnorm(50*10^4), ncol=50)
> m2 <- matrix(10*rnorm(50*10^4), ncol=500)
> # compute bins
> bins <- seq(-100,100,1)
> system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) })
> system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) })
> ---
> Ariel Ortiz-Bobea
> Fellow
> Resources for the Future
> 1616 P Street, N.W.
> Washington, DC 20036
> [[alternative HTML version deleted]]
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list