[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above

Tue Jun 9 00:09:54 CEST 2020

Hi everyone,

I've noticed new behavior in `regexpr(..., perl = TRUE)` on Windows with
R4.0 and above with Unicode characters. Here's a minimal example where I'd
expect to see a start value of `5` (as R 3.6.2 and below gives), but R
4.0.0 (and R 4.0.1) now returns:

> regexpr("b", "foo\U0001F937bar", perl = TRUE)
#> [1] 6
#> attr(,"match.length")
#> [1] 1

Perhaps this change in behavior could be explained by R4.0's migration to
PCRE2? Here is some relevant output from my R4.0 session:

> pcre_config()
#> UTF-8 Unicode properties     JIT    stack
#>  TRUE               TRUE    TRUE    FALSE

> extSoftVersion()
#>         zlib                        bzlib            xz
#> "1.2.11"   "1.0.8, 13-Jul-2019"    "5.2.4"   "10.33 2019-04-16"
#> ICU                                       TRE            iconv
 readline   BLAS
#> "58.2" "TRE 0.8.0 R_fixes (BSD)"  "win_iconv"               ""       ""

Let me know if there's any more information I can provide to help replicate
and isolate the issue. Also, if this happens to be the expected behavior,
I'd be keen to learn about why that's the case.

Thank you,


Carson Sievert, PhD
Software Engineer at RStudio
Website <https://cpsievert.me> | Twitter <https://twitter.com/cpsievert> |
GitHub <https://github.com/cpsievert>

