[R] matching a sequence in a vector?
Petr Savicky
savicky at cs.cas.cz
Thu Feb 16 09:01:49 CET 2012
On Wed, Feb 15, 2012 at 08:12:32PM -0500, Gabor Grothendieck wrote:
> On Tue, Feb 14, 2012 at 11:17 PM, Redding, Matthew
> <Matthew.Redding at deedi.qld.gov.au> wrote:
> > I've been trawling through the documentation and listserv archives on this topic -- but
> > as yet have not found a solution. I'm sure this is pretty simple with R, but I cannot work out how without
> > resorting to ugly nested loops.
> >
> > As far as I can tell, grep, match, and %in% are not the correct tools.
> >
> > Question:
> > given these vectors --
> > patrn <- c(1,2,3,4)
> > exmpl <- c(3,3,4,2,3,1,2,3,4,8,8,23,1,2,3,4,4,34,4,3,2,1,1,2,3,4)
> >
> > how do I get the desired answer by finding the occurence of the pattern and returning the starting indices:
> > 6, 13, 23
> >
>
> Here is a one-liner:
>
> library(zoo)
> which(rollapply(exmpl, 4, identical, patrn, fill = FALSE, align = "left"))
Hi.
There were several solutions in this thread. Their speed differs
quite significantly. Here is a comparison.
patrn <- 1:4
exmpl <- sample(1:4, 10000, replace=TRUE)
occur1 <- function(patrn, exmpl)
{
m <- length(patrn)
n <- length(exmpl)
candidate <- seq.int(length=n-m+1)
for (i in seq.int(length=m)) {
candidate <- candidate[patrn[i] == exmpl[candidate + i - 1]]
}
candidate
}
occur2 <- function(patrn, exmpl)
{
patrn.rev <- rev(patrn)
w <- embed(exmpl,length(patrn))
which(apply(w,1,function(r) all(r == patrn.rev)))
}
occur3 <- function(patrn, exmpl)
{
patrn.rev <- rev(patrn)
w <- embed(exmpl,length(patrn))
which(rowSums(w == rep(rev(patrn), each=nrow(w))) == ncol(w))
}
occur4 <- function(patrn, exmpl)
{
# requires patrn without duplicates
n = length(patrn)
r = rle(diff(match(exmpl, patrn)) == 1L)
cumsum(r$length)[r$values & r$length == (n - 1L)] - (n - 2L)
}
occur5 <- function(patrn, exmpl)
{
which( sapply( 1:(length(exmpl)-length(patrn)+1), function(i) isTRUE( all.equal( patrn, exmpl[i + 0:(length(patrn)-1) ] ) ) ) )
}
occur6 <- function(patrn, exmpl)
{
indx <- embed(rev(seq_along(exmpl)), length(patrn))
matches <- apply(indx, 1, function(.indx){
all(exmpl[.indx] == patrn)
})
rev(indx[matches, 1L])
}
occur7 <- function(patrn, exmpl)
{
which(rollapply(exmpl, length(patrn), identical, patrn, fill = FALSE, align = "left"))
}
library(zoo)
t1 <- system.time( out1 <- occur1(patrn, exmpl) )
t2 <- system.time( out2 <- occur2(patrn, exmpl) )
t3 <- system.time( out3 <- occur3(patrn, exmpl) )
t4 <- system.time( out4 <- occur4(patrn, exmpl) )
t5 <- system.time( out5 <- occur5(patrn, exmpl) )
t6 <- system.time( out6 <- occur6(patrn, exmpl) )
t7 <- system.time( out7 <- occur7(patrn, exmpl) )
print(identical(out1, out2))
print(identical(out1, out3))
print(identical(out1, out4))
print(identical(out1, out5))
print(identical(out1, out6))
print(identical(out1, out7))
print(rbind(t1, t2, t3, t4, t5, t6, t7))
The output was
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
user.self sys.self elapsed user.child sys.child
t1 0.001 0 0.001 0 0
t2 0.062 0 0.061 0 0
t3 0.002 0 0.002 0 0
t4 0.001 0 0.001 0 0
t5 1.749 0 1.749 0 0
t6 0.068 0 0.068 0 0
t7 0.172 0 0.172 0 0
Petr Savicky.
More information about the R-help
mailing list