[Rd] Question about regexp edge case

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Mon Jul 29 02:02:21 CEST 2024


On StackOverflow (here: 
https://stackoverflow.com/questions/78803652/why-does-gsub-in-r-match-one-character-too-many) 
there was a question about this result:

 > gsub("^([0-9]{,5}).*","\\1","123456789")
[1] "123456"

The OP expected "12345" as the result.  Several points were raised:

  - The R docs don't mention the case of {,5} for the default perl = 
FALSE which uses TRE.
  - perl = TRUE gives the OP's expected result of "12345".
  - perl = TRUE does *not* give the documented result on at least one 
system (which is "123456789", because "{,5}" is documented to not be a 
quantifier, so it should only match the literal string "{,5}").
  - Some regexp engines (including Perl and Awk) document that "12345" 
is correct.

Is any of this worth fixing?

Duncan Murdoch



More information about the R-devel mailing list