[Rd] Question about regexp edge case

Fri Aug 9 11:01:59 CEST 2024

On 8/1/24 20:55, Duncan Murdoch wrote:
> Thanks Tomas.  Do note that my original post also mentioned a bug or 
> doc error in the PCRE docs for this regexp:
>
>>   - perl = TRUE does *not* give the documented result on at least one 
>> system (which is "123456789", because "{,5}" is documented to not be 
>> a quantifier, so it should only match the literal string "{,5}").

This is a change in documented behavior in PCRE. PCRE2 10.43 
(share/man/man3/pcre2pattern.3) says:

"If the first number is omitted, the lower limit is taken as zero; in 
this case the upper limit must be present. X{,4} is interpreted as 
X{0,4}. In earlier versions such a sequence was not interpreted as a 
quantifier. Other regular expression engines may behave either way."

And the changelog:

"29. Perl 5.34.0 changed the meaning of (for example) {,3} which did not 
used to be treated as a quantifier. Now it is interpreted as {0,3} and 
PCRE2 has changed to match. Note that {,} is still not a quantifier."

Sadly the previous behavior was also documented in pcre2pattern.3:

"For example, {,6} is not a quantifier, but a literal string of four 
characters"

I've confirmed with R built with PCRE2 10.42, 10.43 and 10.44. In 
practice, users would most likely see the new behavior on Windows, where 
Rtools44 has PCRE2 10.43.

The R documentation (?regex) refers to the PCRE2 documentation for 
"complete details", mentioning how to find out what is the version of 
PCRE(2) used.  I've now added a warning about that PCRE behavior may 
change between versions, with the {,m} as an example. I don't think we 
can do much more - I don't think we should be replicating the PCRE 
documentation/changelog - but we could add more examples, if any 
important appear. Also, we don't want to write R programs that depend on 
concrete versions of PCRE.

It is a good thing that ?regex doesn't document "{,m}", because it 
cannot be used reliably/portably. One should use some of the documented 
forms, instead, i.e. "{0,m}". Indeed there is the problem of how to use 
only the documented subset of behavior (in ?regex), because one also 
needs to avoid accidentally running into undocumented expressions with 
special meaning, like in this case. But perhaps still authors could try 
to defensively avoid risky expressions in literals in patterns, such as 
those involving "{}" or otherwise similar to documented expressions with 
a special meaning.

Best
Tomas

>
> Duncan
>
> On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:
>>
>> On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
>>> В Sun, 28 Jul 2024 20:02:21 -0400
>>> Duncan Murdoch <murdoch.duncan using gmail.com> пишет:
>>>
>>>> gsub("^([0-9]{,5}).*","\\1","123456789")
>>>> [1] "123456"
>>> This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
>>> = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
>>> returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
>>>
>>> Compiling with TRE_DEBUG, I see it parsed correctly:
>>>
>>> catenation, sub 0, 0 tags
>>>     assertions: bol
>>>     iteration {-1, 2}, sub -1, 0 tags, greedy
>>>       literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>>
>>> ...but after tre_expand_ast I see
>>>
>>> catenation, sub 0, 1 tags
>>>     assertions: bol
>>>     catenation, sub -1, 1 tags
>>>       tag 0
>>>       union, sub -1, 0 tags
>>>         literal empty
>>>         catenation, sub -1, 0 tags
>>>           literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
>>>           union, sub -1, 0 tags
>>>             literal empty
>>>             catenation, sub -1, 0 tags
>>>               literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
>>>               union, sub -1, 0 tags
>>>                 literal empty
>>>                 literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>>
>>> ...which has one too many copies of "literal (0,9)". I think it's due
>>> to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
>>>
>>> for (j = iter->min; j < iter->max; j++)
>>>
>>> ...where 'min' is -1 to denote no minimum. This is further confirmed by
>>> "{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.
>>>
>>> Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
>>> from my reading, it looks like if the upper boundary is specified, the
>>> lower boundary must be specified too. But if we do want to fix this, it
>>> will have to be a special case for iter->min == -1.
>>
>> Thanks. It seems that TRE is now maintained again upstream, so it would
>> be best to discuss this with TRE maintainers directly (if not already
>> solved by https://github.com/laurikari/tre/pull/98).
>>
>> The same applies to any other open TRE issues.
>>
>> Best Tomas
>>
>