[Rd] Question about regexp edge case

Fri Aug 9 11:44:32 CEST 2024

Thanks!  I think your suggested additions to the docs are perfect.

Duncan Murdoch

On 2024-08-09 5:01 a.m., Tomas Kalibera wrote:
> 
> On 8/1/24 20:55, Duncan Murdoch wrote:
>> Thanks Tomas.  Do note that my original post also mentioned a bug or
>> doc error in the PCRE docs for this regexp:
>>
>>>    - perl = TRUE does *not* give the documented result on at least one
>>> system (which is "123456789", because "{,5}" is documented to not be
>>> a quantifier, so it should only match the literal string "{,5}").
> 
> This is a change in documented behavior in PCRE. PCRE2 10.43
> (share/man/man3/pcre2pattern.3) says:
> 
> "If the first number is omitted, the lower limit is taken as zero; in
> this case the upper limit must be present. X{,4} is interpreted as
> X{0,4}. In earlier versions such a sequence was not interpreted as a
> quantifier. Other regular expression engines may behave either way."
> 
> And the changelog:
> 
> "29. Perl 5.34.0 changed the meaning of (for example) {,3} which did not
> used to be treated as a quantifier. Now it is interpreted as {0,3} and
> PCRE2 has changed to match. Note that {,} is still not a quantifier."
> 
> Sadly the previous behavior was also documented in pcre2pattern.3:
> 
> "For example, {,6} is not a quantifier, but a literal string of four
> characters"
> 
> I've confirmed with R built with PCRE2 10.42, 10.43 and 10.44. In
> practice, users would most likely see the new behavior on Windows, where
> Rtools44 has PCRE2 10.43.
> 
> The R documentation (?regex) refers to the PCRE2 documentation for
> "complete details", mentioning how to find out what is the version of
> PCRE(2) used.  I've now added a warning about that PCRE behavior may
> change between versions, with the {,m} as an example. I don't think we
> can do much more - I don't think we should be replicating the PCRE
> documentation/changelog - but we could add more examples, if any
> important appear. Also, we don't want to write R programs that depend on
> concrete versions of PCRE.
> 
> It is a good thing that ?regex doesn't document "{,m}", because it
> cannot be used reliably/portably. One should use some of the documented
> forms, instead, i.e. "{0,m}". Indeed there is the problem of how to use
> only the documented subset of behavior (in ?regex), because one also
> needs to avoid accidentally running into undocumented expressions with
> special meaning, like in this case. But perhaps still authors could try
> to defensively avoid risky expressions in literals in patterns, such as
> those involving "{}" or otherwise similar to documented expressions with
> a special meaning.
> 
> Best
> Tomas
> 
> 
>>
>> Duncan
>>
>> On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:
>>>
>>> On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
>>>> В Sun, 28 Jul 2024 20:02:21 -0400
>>>> Duncan Murdoch <murdoch.duncan using gmail.com> пишет:
>>>>
>>>>> gsub("^([0-9]{,5}).*","\\1","123456789")
>>>>> [1] "123456"
>>>> This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
>>>> = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
>>>> returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
>>>>
>>>> Compiling with TRE_DEBUG, I see it parsed correctly:
>>>>
>>>> catenation, sub 0, 0 tags
>>>>      assertions: bol
>>>>      iteration {-1, 2}, sub -1, 0 tags, greedy
>>>>        literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>>>
>>>> ...but after tre_expand_ast I see
>>>>
>>>> catenation, sub 0, 1 tags
>>>>      assertions: bol
>>>>      catenation, sub -1, 1 tags
>>>>        tag 0
>>>>        union, sub -1, 0 tags
>>>>          literal empty
>>>>          catenation, sub -1, 0 tags
>>>>            literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
>>>>            union, sub -1, 0 tags
>>>>              literal empty
>>>>              catenation, sub -1, 0 tags
>>>>                literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
>>>>                union, sub -1, 0 tags
>>>>                  literal empty
>>>>                  literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>>>
>>>> ...which has one too many copies of "literal (0,9)". I think it's due
>>>> to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
>>>>
>>>> for (j = iter->min; j < iter->max; j++)
>>>>
>>>> ...where 'min' is -1 to denote no minimum. This is further confirmed by
>>>> "{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.
>>>>
>>>> Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
>>>> from my reading, it looks like if the upper boundary is specified, the
>>>> lower boundary must be specified too. But if we do want to fix this, it
>>>> will have to be a special case for iter->min == -1.
>>>
>>> Thanks. It seems that TRE is now maintained again upstream, so it would
>>> be best to discuss this with TRE maintainers directly (if not already
>>> solved by https://github.com/laurikari/tre/pull/98).
>>>
>>> The same applies to any other open TRE issues.
>>>
>>> Best Tomas
>>>
>>