[Rd] Potential issue with perl-based pattern matching with Unicode characters on Windows R 4.0 and above
tom@@@k@||ber@ @end|ng |rom gm@||@com
Tue Jun 9 17:01:06 CEST 2020
thanks for the report. This is a bug in R, specific to Windows and to
characters that use surrogate pairs - other characters will work fine,
other recent operating systems where R runs will work fine (all where a
single wchar_t holds complete Unicode characters). Now fixed in R-devel.
If handling of surrogate pairs (e.g. Emoji characters) is important for
you, it would help if you could systematically stress-test R for that. A
number of related bugs have been fixed, but it is not impossible some
are still present as these characters are rarely present in test data.
Also, sometimes fixing bugs ironically introduces new problems. This
regression was caused by a correct fix of a bug related to surrogate
pairs in R 4.0. That old bug was cancelling out this old bug in
post-processing PCRE results.
On 6/9/20 12:09 AM, Carson Sievert wrote:
> Hi everyone,
> I've noticed new behavior in `regexpr(..., perl = TRUE)` on Windows with
> R4.0 and above with Unicode characters. Here's a minimal example where I'd
> expect to see a start value of `5` (as R 3.6.2 and below gives), but R
> 4.0.0 (and R 4.0.1) now returns:
>> regexpr("b", "foo\U0001F937bar", perl = TRUE)
> #>  6
> #> attr(,"match.length")
> #>  1
> Perhaps this change in behavior could be explained by R4.0's migration to
> PCRE2? Here is some relevant output from my R4.0 session:
> #> UTF-8 Unicode properties JIT stack
> #> TRUE TRUE TRUE FALSE
> #> zlib bzlib xz
> #> "1.2.11" "1.0.8, 13-Jul-2019" "5.2.4" "10.33 2019-04-16"
> #> ICU TRE iconv
> readline BLAS
> #> "58.2" "TRE 0.8.0 R_fixes (BSD)" "win_iconv" "" ""
> Let me know if there's any more information I can provide to help replicate
> and isolate the issue. Also, if this happens to be the expected behavior,
> I'd be keen to learn about why that's the case.
> Thank you,
More information about the R-devel