[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above
Carson Sievert
cp@|evert1 @end|ng |rom gm@||@com
Tue Jun 9 00:09:54 CEST 2020
Hi everyone,
I've noticed new behavior in `regexpr(..., perl = TRUE)` on Windows with
R4.0 and above with Unicode characters. Here's a minimal example where I'd
expect to see a start value of `5` (as R 3.6.2 and below gives), but R
4.0.0 (and R 4.0.1) now returns:
```
> regexpr("b", "foo\U0001F937bar", perl = TRUE)
#> [1] 6
#> attr(,"match.length")
#> [1] 1
```
Perhaps this change in behavior could be explained by R4.0's migration to
PCRE2? Here is some relevant output from my R4.0 session:
```
> pcre_config()
#> UTF-8 Unicode properties JIT stack
#> TRUE TRUE TRUE FALSE
```
```
> extSoftVersion()
#> zlib bzlib xz
PCRE
#> "1.2.11" "1.0.8, 13-Jul-2019" "5.2.4" "10.33 2019-04-16"
#> ICU TRE iconv
readline BLAS
#> "58.2" "TRE 0.8.0 R_fixes (BSD)" "win_iconv" "" ""
```
Let me know if there's any more information I can provide to help replicate
and isolate the issue. Also, if this happens to be the expected behavior,
I'd be keen to learn about why that's the case.
Thank you,
-Carson
--
Carson Sievert, PhD
Software Engineer at RStudio
Website <https://cpsievert.me> | Twitter <https://twitter.com/cpsievert> |
GitHub <https://github.com/cpsievert>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list