[R] how to count the total number of (INCLUDING overlapping) occurrences of a substring within a string?

Sun Dec 20 16:04:20 CET 2009

Try this:

> findall("(.).\\1", "ababacababab")
[1]  1  2  3  5  7  8  9 10

> gregexpr("(.)(?=.\\1)", "ababacababab", perl = TRUE)
[[1]]
[1]  1  2  3  5  7  8  9 10
attr(,"match.length")
[1] 1 1 1 1 1 1 1 1

On Sun, Dec 20, 2009 at 9:33 AM, Hans W Borchers
<hwborchers at googlemail.com> wrote:
> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
>>
>> Try this:
>>
>> > findall("aba", "ababacababab")
>> [1] 1 3 7 9
>> > gregexpr("a(?=ba)", "ababacababab", perl = TRUE)
>> [[1]]
>> [1] 1 3 7 9
>> attr(,"match.length")
>> [1] 1 1 1 1
>>
>> > findall("a.a", "ababacababab")
>> [1] 1 3 5 7 9
>> > gregexpr("a(?=.a)", "ababacababab", perl = TRUE)
>> [[1]]
>> [1] 1 3 5 7 9
>> attr(,"match.length")
>> [1] 1 1 1 1 1
>
>
> Thanks --- somehow I did not realize that the expression in  "?=..."
> can also be regular.
>
> My original problem was to find all three character matches where the
> first and the last one are the same.  With  findall()  it works like:
>
>    findall("(.).\\1", "ababacababab")
>    # [1]  1  2  3  5  7  8  9 10
>
> I am still not able to reproduce this with lookahead. Attempts with
>
>    gregexpr("(.)?=.\\1", "ababacababab", perl = TRUE)
>
> do not work as the lookahead expression apparently does not know about
> the captured group from before.
>
> Regards
> Hans Werner
>
> Correction: I meant the '\G' metacharacter in Perl, not a modifier.
>
>
>> On Sun, Dec 20, 2009 at 7:22 AM, Hans W Borchers
>> <hwborchers <at> googlemail.com> wrote:
>> > Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
>> >
>> > [Sorry; Gmane forces me to delete "more quoted text".]
>> >
>> > ----
>> >    findall <- function(apat, atxt) {
>> >      stopifnot(length(apat) == 1, length(atxt) == 1)
>> >      pos <- c()  # positions of matches
>> >      i <- 1; n <- nchar(atxt)
>> >      found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
>> >      while (found > 0) {
>> >        pos <- c(pos, i + found - 1)
>> >        i <- i + found
>> >        found <- regexpr(apat, substr(atxt, i, n), perl=TRUE)
>> >      }
>> >      return(pos)
>> >    }
>> > ----
>> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>