[R] speed of a vector operation question

Fri Apr 26 22:55:07 CEST 2013

R's findInterval can also take advantage of a sorted x vector.  E.g.,
in R-3.0.0 on the same 8-core Linux box:

> x <- rexp(1e6, 2)
> system.time(for(i in 1:100)tabulate(findInterval(x, c(-Inf, .3, .5, Inf)))[2])
   user  system elapsed
  2.444   0.000   2.446
> xs <- sort(x)
> system.time(for(i in 1:100)tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2])
   user  system elapsed
  1.472   0.000   1.475
> 
> tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2]
[1] 180636
> sum( xs > .3 & xs <= .5 )
[1] 180636

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Friday, April 26, 2013 1:33 PM
> To: William Dunlap
> Cc: lcn; Mikhail Umorin; r-help at r-project.org
> Subject: Re: [R] speed of a vector operation question
> 
> A very similar question was asked on StackOverflow (by Mikhail? and then I guess
> the answers there were somehow not satisfactory...)
> 
> 
> http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or-
> match
> 
> where it turns out that a binary search (implemented in R) on the sorted vector
> is much faster than sum, etc. I guess because it's log N without copying. The
> more complicated condition x > .3 & x < .5 could be satisfied with multiple
> calls to the search.
> 
> Martin
> 
> On 04/26/2013 01:20 PM, William Dunlap wrote:
> >
> >> I think the sum way is the best.
> >
> > On my Linux machine running R-3.0.0 the sum way is slightly faster:
> >    > x <- rexp(1e6, 2)
> >    > system.time(for(i in 1:100)sum(x>.3 & x<.5))
> >       user  system elapsed
> >      4.664   0.340   5.018
> >    > system.time(for(i in 1:100)length(which(x>.3 & x<.5)))
> >       user  system elapsed
> >      5.017   0.160   5.186
> >
> > If you are doing many of these counts on the same dataset you
> > can save time by using functions like cut(), table(), ecdf(), and
> > findInterval().  E.g.,
> >> system.time(r1 <- vapply(seq(0,1,by=1/128)[-1], function(i)sum(x>(i-1/128) & x<=i),
> FUN.VALUE=0L))
> >     user  system elapsed
> >    5.332   0.568   5.909
> >> system.time(r2 <- table(cut(x, seq(0,1,by=1/128))))
> >     user  system elapsed
> >    0.500   0.008   0.511
> >> all.equal(as.vector(r1), as.vector(r2))
> > [1] TRUE
> >
> > You should do the timings yourself, as the relative speeds will depend
> > on the version or dialect of  the R interpreter and how it was compiled.
> > E.g., with the current development version of 'TIBCO Enterprise Runtime for R' (aka
> 'TERR')
> > on this same 8-core Linux box the sum way is considerably faster then
> > the length(which) way:
> >    > x <- rexp(1e6, 2)
> >    > system.time(for(i in 1:100)sum(x>.3 & x<.5))
> >       user  system elapsed
> >       1.87    0.03    0.48
> >    > system.time(for(i in 1:100)length(which(x>.3 & x<.5)))
> >       user  system elapsed
> >       3.21    0.04    0.83
> >    > system.time(r1 <- vapply(seq(0,1,by=1/128)[-1], function(i)sum(x>(i-1/128) & x<=i),
> FUN.VALUE=0L))
> >       user  system elapsed
> >       2.19    0.04    0.56
> >    > system.time(r2 <- table(cut(x, seq(0,1,by=1/128))))
> >       user  system elapsed
> >       0.27    0.01    0.13
> >    > all.equal(as.vector(r1), as.vector(r2))
> >    [1] TRUE
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf
> >> Of lcn
> >> Sent: Friday, April 26, 2013 12:09 PM
> >> To: Mikhail Umorin
> >> Cc: r-help at r-project.org
> >> Subject: Re: [R] speed of a vector operation question
> >>
> >> I think the sum way is the best.
> >>
> >>
> >> On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin <mikeumo at gmail.com> wrote:
> >>
> >>> Hello,
> >>>
> >>> I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
> >>> are
> >>> sorted (with duplicates) in the vector (v). I am obtaining the length of
> >>> vectors such as (v < c) or (v > c1 & v < c2), where c, c1, c2 are some
> >>> scalar
> >>> variables. What is the most efficient way to do this?
> >>>
> >>> I am using sum(v < c) since TRUE's are 1's and FALSE's are 0's. This seems
> >>> to
> >>> me more efficient than length(which(v < c)), but, please, correct me if I'm
> >>> wrong. So, is there anything faster than what I already use?
> >>>
> >>> I'm running R 2.14.2 on Linux kernel 3.4.34.
> >>>
> >>> I appreciate your time,
> >>>
> >>> Mikhail
> >>>          [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >> 	[[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793