[R] how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?
Hans W Borchers
hwborchers at googlemail.com
Sun Dec 20 15:33:42 CET 2009
Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
>
> Try this:
>
> > findall("aba", "ababacababab")
> [1] 1 3 7 9
> > gregexpr("a(?=ba)", "ababacababab", perl = TRUE)
> [[1]]
> [1] 1 3 7 9
> attr(,"match.length")
> [1] 1 1 1 1
>
> > findall("a.a", "ababacababab")
> [1] 1 3 5 7 9
> > gregexpr("a(?=.a)", "ababacababab", perl = TRUE)
> [[1]]
> [1] 1 3 5 7 9
> attr(,"match.length")
> [1] 1 1 1 1 1
Thanks --- somehow I did not realize that the expression in "?=..."
can also be regular.
My original problem was to find all three character matches where the
first and the last one are the same. With findall() it works like:
findall("(.).\\1", "ababacababab")
# [1] 1 2 3 5 7 8 9 10
I am still not able to reproduce this with lookahead. Attempts with
gregexpr("(.)?=.\\1", "ababacababab", perl = TRUE)
do not work as the lookahead expression apparently does not know about
the captured group from before.
Regards
Hans Werner
Correction: I meant the '\G' metacharacter in Perl, not a modifier.
> On Sun, Dec 20, 2009 at 7:22 AM, Hans W Borchers
> <hwborchers <at> googlemail.com> wrote:
> > Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> >
> > [Sorry; Gmane forces me to delete "more quoted text".]
> >
> > ----
> > findall <- function(apat, atxt) {
> > stopifnot(length(apat) == 1, length(atxt) == 1)
> > pos <- c() # positions of matches
> > i <- 1; n <- nchar(atxt)
> > found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
> > while (found > 0) {
> > pos <- c(pos, i + found - 1)
> > i <- i + found
> > found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
> > }
> > return(pos)
> > }
> > ----
> >
More information about the R-help
mailing list