[Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound
Martin Maechler
maechler at stat.math.ethz.ch
Wed Apr 5 11:15:20 CEST 2017
>>>>> <dietmar.schindler at manroland-web.com>
>>>>> on Tue, 4 Apr 2017 08:45:30 +0000 writes:
> Dear Sirs,
> while
>> regexpr('(.{1,2})\\1', 'foo')
> [1] 2
> attr(,"match.length")
> [1] 2
> attr(,"useBytes")
> [1] TRUE
> yields the correct match, an incremented upper bound in
>> regexpr('(.{1,3})\\1', 'foo')
> [1] -1
> attr(,"match.length")
> [1] -1
> attr(,"useBytes")
> [1] TRUE
> incorrectly yields no match.
Hmm, yes, I would also say that this is incorrect
(though I'm always cautious: The ?regex help page explicitly
mentions greedy repetitions, and these can "bite you" ..)
The behavior is also different from the perl=TRUE one which is
correct (according to the above understanding).
Using grep() instead of regexpr() makes the behavior easier to parse.
The following code
----------------------------------------------------------------------
tx <- c("ab","abc", paste0("foo", c("", "b", "o", "bar", "oofy")))
setNames(nchar(tx), tx)
## ab abc foo foob fooo foobar foooofy
## 2 3 3 4 4 6 7
grep1r <- function(n, txt, ...) {
pattern <- paste0('(.{1,',n,'})\\1', collapse="") ## can have empty n
ans <- grep(pattern, txt, value=TRUE, ...)
cat(sprintf("pattern '%s' : ", pattern)); print(ans, quote=FALSE)
invisible(ans)
}
grep1r({}, tx)# '.{1,}' : because of _greedy_ matching there is __no__ repetiion!
grep1r(100,tx)# i.e., these both give an empty match : character(0)
## matching at most once:
grep1r(1, tx)# matches all 5 starting with "foo"
grep1r(2, tx)# ditto : all have more than 2 chars
grep1r(3, tx)# not "foo": those with more than 3 chars
grep1r(4, tx)# .. those with more than 4 characters
grep1r(5, tx)# .. those with more than 5 characters
grep1r(6, tx)# .. those with more than 6 characters
grep1r(7, tx)# NONE (= those with more than 7 characters)
for(p in c(FALSE,TRUE)) {
cat("\ngrep(*, perl =", p, ") :\n")
for(n in c(list(NULL), 1:7))
grep1r(n, tx, perl = p)
}
----------------------------------------------------------------------
ends with
> for(p in c(FALSE,TRUE)) {
+ cat("\ngrep(*, perl =", p, ") :\n")
+ for(n in c(list(NULL), 1:7))
+ grep1r(n, tx, perl = p)
+ }
grep(*, perl = FALSE ) :
pattern '(.{1,})\1' : character(0)
pattern '(.{1,1})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,2})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,3})\1' : [1] foob fooo foobar foooofy
pattern '(.{1,4})\1' : [1] foobar foooofy
pattern '(.{1,5})\1' : [1] foobar foooofy
pattern '(.{1,6})\1' : [1] foooofy
pattern '(.{1,7})\1' : character(0)
grep(*, perl = TRUE ) :
pattern '(.{1,})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,1})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,2})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,3})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,4})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,5})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,6})\1' : [1] foo foob fooo foobar foooofy
pattern '(.{1,7})\1' : [1] foo foob fooo foobar foooofy
>
More information about the R-devel
mailing list