[R] General binary search?
Martin Morgan
mtmorgan at fhcrc.org
Wed Apr 6 18:59:25 CEST 2011
On 04/04/2011 01:50 PM, William Dunlap wrote:
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Stavros Macrakis
>> Sent: Monday, April 04, 2011 1:15 PM
>> To: r-help
>> Subject: [R] General binary search?
>>
>> Is there a generic binary search routine in a standard library which
>>
>> a) works for character vectors
>> b) runs in O(log(N)) time?
>>
>> I'm aware of findInterval(x,vec), but it is restricted to
>> numeric vectors.
>
> xtfrm(x) will convert a character (or other) vector to
> a numeric vector with the same ordering. findInterval
> can work on that. E.g.,
> > f0<- function(x, vec) {
> tmp<- xtfrm(c(x, vec))
> findInterval(tmp[seq_along(x)], tmp[-seq_along(x)])
> }
> > f0(c("Baby", "Aunt", "Dog"), LETTERS)
> [1] 2 1 4
> I've never looked at its speed.
For a little progress (though no 'generic binary searchin a standard
library'), here's the 'one-liner'
bsearch1 <-
function(val, tab, L=1L, H=length(tab))
{
while (H >= L) {
M <- L + (H - L) %/% 2L
if (tab[M] > val) H <- M - 1L
else if (tab[M] < val) L <- M + 1L
else return(M)
}
return(L - 1L)
}
It seems like a good candidate for the new (R-2.13) 'compiler' package, so
library(compiler)
bsearch2 <- cmpfun(bsearch1)
And Bill's suggestion
bsearch3 <- function(val, tab) {
tmp <- xtfrm(c(val, tab))
findInterval(tmp[seq_along(val)], tmp[-seq_along(val)])
}
which will work best when 'val' is a vector to be looked up.
A quick look at data.table:::sortedmatch seemed to return matches,
whereas Stavros is looking for lower bounds.
It seems that one could shift the weight more to C code by 'vectorizing'
the one-liner, first as
bsearch5 <-
function(val, tab, L=1L, H=length(tab))
{
b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
i0 <- seq_along(val)
repeat {
M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
i <- tab[M] > val[i0]
b[i0 + i * length(val)] <-
ifelse(i, M - 1L, ifelse(tab[M] < val[i0], M + 1L, M))
i0 <- which(b[i0, "H"] >= b[i0, "L"])
if (!length(i0)) break;
}
b[,"L"] - 1L
}
and then a little more thoughtfully (though more room for improvement) as
bsearch7 <-
function(val, tab, L=1L, H=length(tab))
{
b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
i0 <- seq_along(val)
repeat {
updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
tabM <- tab[M]
val0 <- val[i0]
i <- tabM < val0
updt[i] <- M[i] + 1L
i <- tabM > val0
updt[i] <- M[i] - 1L
b[i0 + i * length(val)] <- updt
i0 <- which(b[i0, "H"] >= b[i0, "L"])
if (!length(i0)) break;
}
b[,"L"] - 1L
}
none of bsearch 3, 5, or 7 is likely to benefit substantially from
compilation.
Here's a little test data set converting numeric to character as an easy
cheat.
set.seed(123L)
x <- sort(as.character(rnorm(1e6)))
y <- as.character(rnorm(1e4))
There seems to be some significant initial overhead, so we warm things
up (and also introduce the paradigm for multiple look-ups in bsearch 1, 2)
warmup <- function(y, x) {
lapply(y, bsearch1, x)
lapply(y, bsearch2, x)
bsearch3(y, x)
bsearch5(y, x)
bsearch7(y, x)
}
replicate(3, warmup(y, x))
and then time
> system.time(res1 <- unlist(lapply(y, bsearch1, x), use.names=FALSE))
user system elapsed
2.692 0.000 2.696
> system.time(res2 <- unlist(lapply(y, bsearch2, x), use.names=FALSE))
user system elapsed
1.379 0.000 1.380
> identical(res1, res2)
[1] TRUE
> system.time(res3 <- bsearch3(y, x)); identical(res1, res3)
user system elapsed
8.339 0.001 8.350
[1] TRUE
> system.time(res5 <- bsearch5(y, x)); identical(res1, res5)
user system elapsed
0.700 0.000 0.702
[1] TRUE
> system.time(res7 <- bsearch7(y, x)); identical(res1, res7)
user system elapsed
0.222 0.000 0.222
[1] TRUE
Martin
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>
>> I'm also aware of various hashing solutions (e.g.
>> new.env(hash=TRUE) and
>> fastmatch), but I need the greatest-lower-bound match in my
>> application.
>>
>> findInterval is also slow for large N=length(vec) because of the O(N)
>> checking it does, as Duncan Murdoch has pointed
>> out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>:
>> though
>> its documentation says it runs in O(n * log(N)), it actually
>> runs in O(n *
>> log(N) + N), which is quite noticeable for largish N. But
>> that is easy
>> enough to work around by writing a variant of findInterval which calls
>> find_interv_vec without checking.
>>
>> -s
>>
>> PS Yes, binary search is a one-liner in R, but I always prefer to use
>> standard, fast native libraries when possible....
>>
>> binarysearch<- function(val,tab,L,H) {while (H>=L) {
>> M=L+(H-L) %/% 2; if
>> (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)};
>> return(L-1)}
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-help
mailing list