[Rd] Question about regexp edge case

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu Aug 1 12:49:57 CEST 2024


On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
> В Sun, 28 Jul 2024 20:02:21 -0400
> Duncan Murdoch <murdoch.duncan using gmail.com> пишет:
>
>> gsub("^([0-9]{,5}).*","\\1","123456789")
>> [1] "123456"
> This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
> = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
> returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
>
> Compiling with TRE_DEBUG, I see it parsed correctly:
>
> catenation, sub 0, 0 tags
>    assertions: bol
>    iteration {-1, 2}, sub -1, 0 tags, greedy
>      literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>
> ...but after tre_expand_ast I see
>
> catenation, sub 0, 1 tags
>    assertions: bol
>    catenation, sub -1, 1 tags
>      tag 0
>      union, sub -1, 0 tags
>        literal empty
>        catenation, sub -1, 0 tags
>          literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
>          union, sub -1, 0 tags
>            literal empty
>            catenation, sub -1, 0 tags
>              literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
>              union, sub -1, 0 tags
>                literal empty
>                literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>
> ...which has one too many copies of "literal (0,9)". I think it's due
> to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
>
> for (j = iter->min; j < iter->max; j++)
>
> ...where 'min' is -1 to denote no minimum. This is further confirmed by
> "{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.
>
> Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
> from my reading, it looks like if the upper boundary is specified, the
> lower boundary must be specified too. But if we do want to fix this, it
> will have to be a special case for iter->min == -1.

Thanks. It seems that TRE is now maintained again upstream, so it would 
be best to discuss this with TRE maintainers directly (if not already 
solved by https://github.com/laurikari/tre/pull/98).

The same applies to any other open TRE issues.

Best Tomas



More information about the R-devel mailing list