[Rd] Bug in PCRE interface code

Toby Hocking tdhock5 @end|ng |rom gm@||@com
Tue Sep 5 23:06:49 CEST 2023


BTW this is documented here
http://pcre.org/current/doc/html/pcre2api.html#infoaboutpattern with a
helpful example, copied below.

As a simple example of the name/number table, consider the following
pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
is set, so white space - including newlines - is ignored):

  (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )

There are four named capture groups, so the table has four entries,
and each entry in the table is eight bytes long. The table is as
follows, with non-printing bytes shows in hexadecimal, and undefined
bytes shown as ??:

  00 01 d  a  t  e  00 ??
  00 05 d  a  y  00 ?? ??
  00 04 m  o  n  t  h  00
  00 02 y  e  a  r  00 ??

On Mon, Sep 4, 2023 at 3:02 AM Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>
> This Stackoverflow question https://stackoverflow.com/q/77036362 turned
> up a bug in the R PCRE interface.
>
> The example (currently in an edit to the original question) tried to use
> named capture with more than 127 named groups.  Here's the code:
>
> append_unique_id <- function(x) {
>    for (i in seq_along(x)) {
>      x[i] <- paste0("<", paste(sample(letters, 10), collapse = ""), ">",
> x[i])
>    }
>    x
> }
>
> list_regexes <- sample(letters, 128, TRUE) # <<<<<<<<<<< change this to
>                                             #             127 and it works
> regex2 <- append_unique_id(list_regexes)
> regex2 <- paste0("(?", regex2, ")")
> regex2 <- paste(regex2, collapse = "|")
>
> out <- gregexpr(regex2, "Cyprus", perl = TRUE, ignore.case = TRUE)
> #> Error in gregexpr(regex2, "Cyprus", perl = TRUE, ignore.case = TRUE):
> attempt to set index -129/128 in SET_STRING_ELT
>
> I think the bug is in R, here:
> https://github.com/wch/r-source/blob/57d15d68235dd9bcfaa51fce83aaa71163a020e1/src/main/grep.c#L3079
>
> This is the line
>
>             int capture_num = (entry[0]<<8) + entry[1] - 1;
>
> where entry is declared as a pointer to a char.  What this is doing is
> extracting a 16 bit number from the first two bytes of a character
> string holding the name of the capture group.  Since char is a signed
> type, the conversion of bytes to integer gets messed up and the value
> comes out wrong.
>
> Duncan Murdoch
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list