[R] Find number of elements less than some number: Elegant/fastsolution needed
Ray Brownrigg
Ray.Brownrigg at ecs.vuw.ac.nz
Fri Apr 15 04:45:27 CEST 2011
Ravi:
Well, that's a challenge!
Here is a solution which is faster than Bill's as long as length(y) << length(x).
f3 <- function(x, y) length(y)*ecdf(y)(x)
Then using your test on my workstation, I get for f2():
> tt
[1] 0.761 1.071 1.329 1.565 1.894 3.865 11.824
and for f3():
> tt
[1] 0.717 0.882 1.010 1.161 1.389 3.080 15.530
I suspect ecdf() is a little more intuitive than findInterval().
Hope this helps,
Ray Brownrigg
On Fri, 15 Apr 2011, Ravi Varadhan wrote:
> Bill's code is insanely fast!
>
> f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
>
> n1 <- 1e07
> n2 <- 10^c(1,2,3,4,5,6,7)
> tt <- rep(NA, 7)
> x <- rnorm(n1)
> for (i in 1:length(n2)){
> y <- runif(n2[i])
> tt[i] <- system.time(a1 <- f2(x, y))[3]
> }
>
> > tt
>
> [1] 0.70 0.86 1.03 1.28 1.54 4.99 12.07
>
>
> I would be surprised if this can be beaten even in C/C++.
>
> Ravi.
> ________________________________________
> From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf
> Of Douglas Bates [bates at stat.wisc.edu] Sent: Thursday, April 14, 2011 6:22
> PM
> To: William Dunlap
> Cc: r-help at r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel
> Subject: Re: [R] Find number of elements less than some number:
> Elegant/fastsolution needed
>
> My colleague Sunduz Keles once mentioned a similar problem to me. She
> had a large sample from a reference distribution and a test sample
> (both real-valued in her case) and she wanted, for each element of the
> test sample, the proportion of the reference sample that was less than
> the element. It's a type of empirical p-value calculation.
>
> I forget the exact sizes of the samples (do you remember, Sunduz?) but
> they were in the tens of thousands or larger. Solutions in R tended
> to involve comparing every element in the test sample to every element
> in the reference sample but, of course, that is unnecessary. If you
> sort both samples then you can start the comparisons for a particular
> element of the test sample at the element of the reference sample
> where the last comparison failed.
>
> I was able to write a very short C++ function using the Rcpp package
> that provided about a 1000-fold increase in speed relative to the best
> I could do in R. I don't have the script on this computer so I will
> post it tomorrow when I am back on the computer at the office.
>
> Apologies for cross-posting to the Rcpp-devel list but I am doing so
> because this might make a good example of the usefulness of Rcpp and
> inline.
>
> On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap at tibco.com> wrote:
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org
> >> [mailto:r-help-bounces at r-project.org] On Behalf Of Kevin Ummel
> >> Sent: Thursday, April 14, 2011 12:35 PM
> >> To: r-help at r-project.org
> >> Subject: [R] Find number of elements less than some number:
> >> Elegant/fastsolution needed
> >>
> >> Take vector x and a subset y:
> >>
> >> x=1:10
> >>
> >> y=c(4,5,7,9)
> >>
> >> For each value in 'x', I want to know how many elements in
> >> 'y' are less than 'x'.
> >>
> >> An example would be:
> >>
> >> sapply(x,FUN=function(i) {length(which(y<i))})
> >>
> >> [1] 0 0 0 0 1 2 2 3 3 4
> >>
> >> But this solution is far too slow when x and y have lengths
> >> in the millions.
> >>
> >> I'm certain an elegant (and computationally efficient)
> >> solution exists, but I'm in the weeds at this point.
> >
> > If x is sorted findInterval(x, y) would do it for <= but you
> > want strict <. Try
> >
> > f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
> >
> > where your version is
> >
> > f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
> >
> > E.g.,
> >
> > > x <- sort(sample(1e6, size=1e5))
> > > y <- sample(1e6, size=1e4, replace=TRUE)
> > > system.time(r0 <- f0(x,y))
> > >
> > user system elapsed
> >
> > 7.900 0.310 8.211
> >
> > > system.time(r2 <- f2(x,y))
> > >
> > user system elapsed
> >
> > 0.000 0.000 0.004
> >
> > > identical(r0, r2)
> >
> > [1] TRUE
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >> Any help is much appreciated.
> >>
> >> Kevin
> >>
> >> University of Manchester
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Take two vectors x and y, where y is a subset of x:
> >>
> >> x=1:10
> >>
> >> y=c(2,5,6,9)
> >>
> >> If y is removed from x, the original x values now have a new
> >> placement (index) in the resulting vector (new):
> >>
> >> new=x[-y]
> >>
> >> index=1:length(new)
> >>
> >> The challenge is: How can I *quickly* and *efficiently*
> >> deduce the new 'index' value directly from the original 'x'
> >> value -- using only 'y' as an input?
> >>
> >> In practice, I have very large matrices containing the 'x'
> >> values, and I need to convert them to the corresponding
> >> 'index' if the 'y' values are removed.
> >>
> >> [[alternative HTML version deleted]]
More information about the R-help
mailing list