[Rd] Bug in perl=TRUE regexp matching?

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Sun Jul 23 22:29:10 CEST 2023


The help page for `?gsub` says (in the context of performance 
considerations):


"... just one UTF-8 string will force all the matching to be done in 
Unicode"


However, this thread on SO:  https://stackoverflow.com/q/76749529 gives 
some indication that this is not true for `perl = TRUE`.  Specifically:

 > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος", 
"Jean-François Dupuis")
 > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
 > regex <- "\\B\\w+| +"
 > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

 > gsub(regex, "", strings, perl = TRUE)
[1] "85"                  "JS"                  "ΓιάννηςΠαπαδόπουλος" 
"J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer 
when the regex option /u ("match with full Unicode) is specified, but 
the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this 
looks like a flag may have been missed in the `perl = TRUE` case.

Duncan Murdoch



More information about the R-devel mailing list