[Rd] Question about regexp edge case

Ivan Krylov |kry|ov @end|ng |rom d|@root@org
Mon Jul 29 09:37:57 CEST 2024


В Sun, 28 Jul 2024 20:02:21 -0400
Duncan Murdoch <murdoch.duncan using gmail.com> пишет:

> gsub("^([0-9]{,5}).*","\\1","123456789")  
> [1] "123456"

This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
= 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.

Compiling with TRE_DEBUG, I see it parsed correctly:

catenation, sub 0, 0 tags
  assertions: bol
  iteration {-1, 2}, sub -1, 0 tags, greedy
    literal (0, 9) (48, 57), pos 0, sub -1, 0 tags

...but after tre_expand_ast I see

catenation, sub 0, 1 tags
  assertions: bol
  catenation, sub -1, 1 tags
    tag 0
    union, sub -1, 0 tags
      literal empty
      catenation, sub -1, 0 tags
        literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
        union, sub -1, 0 tags
          literal empty
          catenation, sub -1, 0 tags
            literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
            union, sub -1, 0 tags
              literal empty
              literal (0, 9) (48, 57), pos 0, sub -1, 0 tags

...which has one too many copies of "literal (0,9)". I think it's due
to the expansion loop on line 942 of src/extra/tre/tre-compile.c being

for (j = iter->min; j < iter->max; j++)

...where 'min' is -1 to denote no minimum. This is further confirmed by
"{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.

Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
from my reading, it looks like if the upper boundary is specified, the
lower boundary must be specified too. But if we do want to fix this, it
will have to be a special case for iter->min == -1.

-- 
Best regards,
Ivan

[1]
https://laurikari.net/tre/documentation/regex-syntax/

[2]
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap09.html#tag_09_03_06



More information about the R-devel mailing list