[R] speeding up applying hist() over rows of a matrix

William Dunlap wdunlap at tibco.com
Fri May 2 18:30:43 CEST 2014


And since as.integer(cut(x,bins)) is essentially findInterval(x,bins)
(since we throw away the labels made by cut()), I tried using
findInterval instead of cut() and it cut the time by more than half,
so your 5.0 s. is now c. 0.1 s.
f3 <- function (m, bins)
{
    nbins <- length(bins) - 1L
    m <- array(findInterval(m, bins), dim = dim(m))
    t(apply(m, 1, tabulate, nbins = nbins))
}
> system.time(r3 <- f3(m1,bins))
   user  system elapsed
   0.09    0.00    0.09
> identical(r0,r3)
[1] TRUE

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, May 2, 2014 at 9:23 AM, William Dunlap <wdunlap at tibco.com> wrote:
> Your original code, as a function of 'm' and 'bins' is
> f0 <- function (m, bins) {
>     t(apply(m, 1, function(x) hist(x, breaks = bins, plot = FALSE)$counts))
> }
> and the time it takes to run on your m1 is about 5 s. on my machine
>> system.time(r0 <- f0(m1,bins))
>    user  system elapsed
>    4.95    0.00    5.02
>
>
> hist(x,breaks=bins) is essentially tabulate(cut(x,bins),nbins=length(bins)-1).
> See how much it speeds things up by replacing hist() with tabulate(cut()):
> f1 <- function (m, bins)
> {
>     nbins <- length(bins) - 1L
>     t(apply(m, 1, function(x) tabulate(cut(x, bins), nbins = nbins)))
> }
> That doesn't help with the time, but it does give the same output
>> system.time(r1 <- f1(m1,bins))
>    user  system elapsed
>    4.85    0.10    5.35
>> identical(r0, r1)
> [1] TRUE
>
> Now try speeding it up by calling cut() on the whole matrix first
> and then applying tabulate to each row, as in
> f2 <- function (m, bins)  {
>     nbins <- length(bins) - 1L
>     m <- array(as.integer(cut(m, bins)), dim = dim(m))
>     t(apply(m, 1, tabulate, nbins = nbins))
> }
> That saves quite a bit of time and gives the same output
>> system.time(r2 <- f2(m1,bins))
>    user  system elapsed
>    0.25    0.00    0.25
>> identical(r0, r2)
> [1] TRUE
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Thu, May 1, 2014 at 12:48 PM, Ortiz-Bobea, Ariel <Ortiz-Bobea at rff.org> wrote:
>> Hello everyone,
>>
>>
>>
>> I'm trying to construct bins for each row in a matrix. I'm using apply() in combination with hist() to do this. Performing this binning for a 10K-by-50 matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500 matrix. This suggests the bottleneck is accessing rows in apply() rather than the calculations going on inside hist().
>>
>>
>>
>> My initial idea is to process as many columns (as make sense for the intended use) at once. However, I still have many many rows to process and I would appreciate any feedback on how to speed this up.
>>
>>
>>
>> Any thoughts?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Ariel
>>
>>
>>
>> Here is the illustration:
>>
>>
>>
>> # create data
>>
>> m1 <- matrix(10*rnorm(50*10^4), ncol=50)
>>
>> m2 <- matrix(10*rnorm(50*10^4), ncol=500)
>>
>>
>>
>> # compute bins
>>
>> bins <- seq(-100,100,1)
>>
>> system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) })
>>
>> system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) })
>>
>>
>> ---
>> Ariel Ortiz-Bobea
>> Fellow
>> Resources for the Future
>> 1616 P Street, N.W.
>> Washington, DC 20036
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list