[Rd] Bug in perl=TRUE regexp matching?
Duncan Murdoch
murdoch@dunc@n @end|ng |rom gm@||@com
Sun Jul 23 22:29:10 CEST 2023
The help page for `?gsub` says (in the context of performance
considerations):
"... just one UTF-8 string will force all the matching to be done in
Unicode"
However, this thread on SO: https://stackoverflow.com/q/76749529 gives
some indication that this is not true for `perl = TRUE`. Specifically:
> strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
"Jean-François Dupuis")
> Encoding(strings)
[1] "unknown" "unknown" "UTF-8" "UTF-8"
> regex <- "\\B\\w+| +"
> gsub(regex, "", strings)
[1] "85" "JS" "ΓΠ" "J-FD"
> gsub(regex, "", strings, perl = TRUE)
[1] "85" "JS" "ΓιάννηςΠαπαδόπουλος"
"J-FçoD"
and the website https://regex101.com/r/QDFrOE/1 gives the first answer
when the regex option /u ("match with full Unicode) is specified, but
the second answer when it is not.
Now I'm not at all sure that that website is authoritative, but this
looks like a flag may have been missed in the `perl = TRUE` case.
Duncan Murdoch
More information about the R-devel
mailing list