[R] how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?

Hans W Borchers hwborchers at googlemail.com
Sun Dec 20 15:33:42 CET 2009


Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> 
> Try this:
> 
> > findall("aba", "ababacababab")
> [1] 1 3 7 9
> > gregexpr("a(?=ba)", "ababacababab", perl = TRUE)
> [[1]]
> [1] 1 3 7 9
> attr(,"match.length")
> [1] 1 1 1 1
> 
> > findall("a.a", "ababacababab")
> [1] 1 3 5 7 9
> > gregexpr("a(?=.a)", "ababacababab", perl = TRUE)
> [[1]]
> [1] 1 3 5 7 9
> attr(,"match.length")
> [1] 1 1 1 1 1


Thanks --- somehow I did not realize that the expression in  "?=..."
can also be regular.

My original problem was to find all three character matches where the
first and the last one are the same.  With  findall()  it works like:

    findall("(.).\\1", "ababacababab")
    # [1]  1  2  3  5  7  8  9 10

I am still not able to reproduce this with lookahead. Attempts with

    gregexpr("(.)?=.\\1", "ababacababab", perl = TRUE)

do not work as the lookahead expression apparently does not know about
the captured group from before.

Regards
Hans Werner

Correction: I meant the '\G' metacharacter in Perl, not a modifier.


> On Sun, Dec 20, 2009 at 7:22 AM, Hans W Borchers
> <hwborchers <at> googlemail.com> wrote:
> > Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> >
> > [Sorry; Gmane forces me to delete "more quoted text".]
> >
> > ----
> >    findall <- function(apat, atxt) {
> >      stopifnot(length(apat) == 1, length(atxt) == 1)
> >      pos <- c()  # positions of matches
> >      i <- 1; n <- nchar(atxt)
> >      found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
> >      while (found > 0) {
> >        pos <- c(pos, i + found - 1)
> >        i <- i + found
> >        found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
> >      }
> >      return(pos)
> >    }
> > ----
> >




More information about the R-help mailing list