[R] how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?
Hans W Borchers
hwborchers at googlemail.com
Sun Dec 20 13:22:59 CET 2009
Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
>
> Use a zero lookaround expression. It will not consume its match. See ?regexp
>
> > gregexpr("a(?=a)", "aaa", perl = TRUE)
> [[1]]
> [1] 1 2
> attr(,"match.length")
> [1] 1 1
I wonder how you would count the number of occurrences of, for example,
'aba' or 'a.a' (*) in the string "ababacababab" using simple lookahead?
In Perl, there is a modifier '/g' to do that, and in Python one could
apply the function 'findall'.
When I had this task, I wrote a small function findall(), see below, but
I would be glad to see a solution with lookahead only.
Regards
Hans Werner
(*) or anything more complex
----
findall <- function(apat, atxt) {
stopifnot(length(apat) == 1, length(atxt) == 1)
pos <- c() # positions of matches
i <- 1; n <- nchar(atxt)
found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
while (found > 0) {
pos <- c(pos, i + found - 1)
i <- i + found
found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
}
return(pos)
}
----
> On Sun, Dec 20, 2009 at 1:43 AM, Jonathan <jonsleepy <at> gmail.com> wrote:
> > Last one for you guys:
> >
> > The command:
> >
> > length(gregexpr('cus','hocus pocus')[[1]])
> > [1] 2
> >
> > returns the number of times the substring 'cus' appears in 'hocus pocus'
> > (which is two)
> >
> > It's returning the number of **disjoint** matches. So:
> >
> > length(gregexpr('aa','aaa')[[1]])
> > [1] 1
> >
> > returns 1.
> >
> > **What I want to do:**
> > I'm looking for a way to count all occurrences of the substring, including
> > overlapping sets (so 'aa' would be found in 'aaa' two times, because the
> > middle 'a' gets counted twice).
> >
> > Any ideas would be much appreciated!!
> >
> > Signing off and thanks for all the great assistance,
> > Jonathan
>
>
More information about the R-help
mailing list