[Rd] Question about regexp edge case
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Fri Aug 9 11:01:59 CEST 2024
On 8/1/24 20:55, Duncan Murdoch wrote:
> Thanks Tomas. Do note that my original post also mentioned a bug or
> doc error in the PCRE docs for this regexp:
>
>> - perl = TRUE does *not* give the documented result on at least one
>> system (which is "123456789", because "{,5}" is documented to not be
>> a quantifier, so it should only match the literal string "{,5}").
This is a change in documented behavior in PCRE. PCRE2 10.43
(share/man/man3/pcre2pattern.3) says:
"If the first number is omitted, the lower limit is taken as zero; in
this case the upper limit must be present. X{,4} is interpreted as
X{0,4}. In earlier versions such a sequence was not interpreted as a
quantifier. Other regular expression engines may behave either way."
And the changelog:
"29. Perl 5.34.0 changed the meaning of (for example) {,3} which did not
used to be treated as a quantifier. Now it is interpreted as {0,3} and
PCRE2 has changed to match. Note that {,} is still not a quantifier."
Sadly the previous behavior was also documented in pcre2pattern.3:
"For example, {,6} is not a quantifier, but a literal string of four
characters"
I've confirmed with R built with PCRE2 10.42, 10.43 and 10.44. In
practice, users would most likely see the new behavior on Windows, where
Rtools44 has PCRE2 10.43.
The R documentation (?regex) refers to the PCRE2 documentation for
"complete details", mentioning how to find out what is the version of
PCRE(2) used. I've now added a warning about that PCRE behavior may
change between versions, with the {,m} as an example. I don't think we
can do much more - I don't think we should be replicating the PCRE
documentation/changelog - but we could add more examples, if any
important appear. Also, we don't want to write R programs that depend on
concrete versions of PCRE.
It is a good thing that ?regex doesn't document "{,m}", because it
cannot be used reliably/portably. One should use some of the documented
forms, instead, i.e. "{0,m}". Indeed there is the problem of how to use
only the documented subset of behavior (in ?regex), because one also
needs to avoid accidentally running into undocumented expressions with
special meaning, like in this case. But perhaps still authors could try
to defensively avoid risky expressions in literals in patterns, such as
those involving "{}" or otherwise similar to documented expressions with
a special meaning.
Best
Tomas
>
> Duncan
>
> On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:
>>
>> On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
>>> В Sun, 28 Jul 2024 20:02:21 -0400
>>> Duncan Murdoch <murdoch.duncan using gmail.com> пишет:
>>>
>>>> gsub("^([0-9]{,5}).*","\\1","123456789")
>>>> [1] "123456"
>>> This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
>>> = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
>>> returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
>>>
>>> Compiling with TRE_DEBUG, I see it parsed correctly:
>>>
>>> catenation, sub 0, 0 tags
>>> assertions: bol
>>> iteration {-1, 2}, sub -1, 0 tags, greedy
>>> literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>>
>>> ...but after tre_expand_ast I see
>>>
>>> catenation, sub 0, 1 tags
>>> assertions: bol
>>> catenation, sub -1, 1 tags
>>> tag 0
>>> union, sub -1, 0 tags
>>> literal empty
>>> catenation, sub -1, 0 tags
>>> literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
>>> union, sub -1, 0 tags
>>> literal empty
>>> catenation, sub -1, 0 tags
>>> literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
>>> union, sub -1, 0 tags
>>> literal empty
>>> literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>>
>>> ...which has one too many copies of "literal (0,9)". I think it's due
>>> to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
>>>
>>> for (j = iter->min; j < iter->max; j++)
>>>
>>> ...where 'min' is -1 to denote no minimum. This is further confirmed by
>>> "{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.
>>>
>>> Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
>>> from my reading, it looks like if the upper boundary is specified, the
>>> lower boundary must be specified too. But if we do want to fix this, it
>>> will have to be a special case for iter->min == -1.
>>
>> Thanks. It seems that TRE is now maintained again upstream, so it would
>> be best to discuss this with TRE maintainers directly (if not already
>> solved by https://github.com/laurikari/tre/pull/98).
>>
>> The same applies to any other open TRE issues.
>>
>> Best Tomas
>>
>
More information about the R-devel
mailing list