[Rd] Question about regexp edge case

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Thu Aug 1 20:55:36 CEST 2024


Thanks Tomas.  Do note that my original post also mentioned a bug or doc 
error in the PCRE docs for this regexp:

>   - perl = TRUE does *not* give the documented result on at least one 
> system (which is "123456789", because "{,5}" is documented to not be a 
> quantifier, so it should only match the literal string "{,5}").

Duncan

On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:
> 
> On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
>> В Sun, 28 Jul 2024 20:02:21 -0400
>> Duncan Murdoch <murdoch.duncan using gmail.com> пишет:
>>
>>> gsub("^([0-9]{,5}).*","\\1","123456789")
>>> [1] "123456"
>> This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
>> = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
>> returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
>>
>> Compiling with TRE_DEBUG, I see it parsed correctly:
>>
>> catenation, sub 0, 0 tags
>>     assertions: bol
>>     iteration {-1, 2}, sub -1, 0 tags, greedy
>>       literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>
>> ...but after tre_expand_ast I see
>>
>> catenation, sub 0, 1 tags
>>     assertions: bol
>>     catenation, sub -1, 1 tags
>>       tag 0
>>       union, sub -1, 0 tags
>>         literal empty
>>         catenation, sub -1, 0 tags
>>           literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
>>           union, sub -1, 0 tags
>>             literal empty
>>             catenation, sub -1, 0 tags
>>               literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
>>               union, sub -1, 0 tags
>>                 literal empty
>>                 literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>
>> ...which has one too many copies of "literal (0,9)". I think it's due
>> to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
>>
>> for (j = iter->min; j < iter->max; j++)
>>
>> ...where 'min' is -1 to denote no minimum. This is further confirmed by
>> "{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.
>>
>> Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
>> from my reading, it looks like if the upper boundary is specified, the
>> lower boundary must be specified too. But if we do want to fix this, it
>> will have to be a special case for iter->min == -1.
> 
> Thanks. It seems that TRE is now maintained again upstream, so it would
> be best to discuss this with TRE maintainers directly (if not already
> solved by https://github.com/laurikari/tre/pull/98).
> 
> The same applies to any other open TRE issues.
> 
> Best Tomas
>



More information about the R-devel mailing list