[Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound

Martin Maechler maechler at stat.math.ethz.ch
Wed Apr 5 11:15:20 CEST 2017


>>>>>   <dietmar.schindler at manroland-web.com>
>>>>>     on Tue, 4 Apr 2017 08:45:30 +0000 writes:

    > Dear Sirs,
    > while

    >> regexpr('(.{1,2})\\1', 'foo')
    > [1] 2
    > attr(,"match.length")
    > [1] 2
    > attr(,"useBytes")
    > [1] TRUE

    > yields the correct match, an incremented upper bound in

    >> regexpr('(.{1,3})\\1', 'foo')
    > [1] -1
    > attr(,"match.length")
    > [1] -1
    > attr(,"useBytes")
    > [1] TRUE

    > incorrectly yields no match.

Hmm, yes, I would also say that this is incorrect
(though I'm always cautious: The  ?regex  help page explicitly
 mentions greedy repetitions, and these can "bite you" ..)

The behavior is also different from the  perl=TRUE one which is
correct (according to the above understanding).

Using  grep() instead of regexpr() makes the behavior easier to parse.
The following code 
----------------------------------------------------------------------

tx <- c("ab","abc", paste0("foo", c("", "b", "o", "bar", "oofy")))
setNames(nchar(tx), tx)
## ab     abc     foo    foob    fooo  foobar foooofy
##  2       3       3       4       4       6       7

grep1r <- function(n, txt, ...) {
    pattern <- paste0('(.{1,',n,'})\\1', collapse="") ## can have empty n
    ans <- grep(pattern, txt, value=TRUE, ...)
    cat(sprintf("pattern '%s' : ", pattern)); print(ans, quote=FALSE)
    invisible(ans)
}

grep1r({}, tx)# '.{1,}' : because of _greedy_ matching there is __no__ repetiion!
grep1r(100,tx)# i.e., these both give an empty match :  character(0)

## matching at most once:
grep1r(1, tx)# matches all 5 starting with "foo"
grep1r(2, tx)# ditto    : all have more than 2 chars
grep1r(3, tx)# not "foo": those with more than 3 chars
grep1r(4, tx)# .. those with more than 4 characters
grep1r(5, tx)# .. those with more than 5 characters
grep1r(6, tx)# .. those with more than 6 characters
grep1r(7, tx)# NONE (= those with more than 7 characters)

for(p in c(FALSE,TRUE)) {
    cat("\ngrep(*, perl =", p, ") :\n")
    for(n in c(list(NULL), 1:7))
        grep1r(n, tx, perl = p)
}

----------------------------------------------------------------------

ends with

> for(p in c(FALSE,TRUE)) {
+     cat("\ngrep(*, perl =", p, ") :\n")
+     for(n in c(list(NULL), 1:7))
+         grep1r(n, tx, perl = p)
+ }

grep(*, perl = FALSE ) :
pattern '(.{1,})\1' : character(0)
pattern '(.{1,1})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,2})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,3})\1' : [1] foob    fooo    foobar  foooofy
pattern '(.{1,4})\1' : [1] foobar  foooofy
pattern '(.{1,5})\1' : [1] foobar  foooofy
pattern '(.{1,6})\1' : [1] foooofy
pattern '(.{1,7})\1' : character(0)

grep(*, perl = TRUE ) :
pattern '(.{1,})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,1})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,2})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,3})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,4})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,5})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,6})\1' : [1] foo     foob    fooo    foobar  foooofy
pattern '(.{1,7})\1' : [1] foo     foob    fooo    foobar  foooofy
>



More information about the R-devel mailing list