[Rd] R 3.5.3 and 3.6.0 alpha Windows bug: UTF-8 characters in code are simplified to wrong ones

Tomáš Bořil bor||t @end|ng |rom gm@||@com
Wed Apr 10 10:22:04 CEST 2019


Hello,

There is a long-lasting problem with processing UTF-8 source code in R
on Windows OS. As Windows do not have "UTF-8" locale and R passes
source code through OS before executing it, some characters are
"simplified" by the OS before processing, leading to undesirable
changes.

Minimalistic example:
Let's type "ř" (LATIN SMALL LETTER R WITH CARON) in RGui console:
> "ř"
[1] "r"

Let's assume the following script:
# file [script.R]
if ("ř" != "\U00159") {
    stop("Problem: Unexpected character conversion.")
} else {
    cat("o.k.\n")
}

Problem:
source("script.R", encoding = "UTF-8")

OK (see https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding):
eval(parse("script.R", encoding = "UTF-8"))

Although the script is in UTF-8, the characters are replaced by
"simplified" substitutes uncontrollably (depending on OS locale). The
same goes with simply entering the code statements in R Console.

The problem does not occur on OS with UTF-8 locale (Mac OS, Linux...)

Best regards
Tomas Boril

> R.version
               _
platform       x86_64-w64-mingw32
arch           x86_64
os             mingw32
system         x86_64, mingw32
status         alpha
major          3
minor          6.0
year           2019
month          04
day            07
svn rev        76333
language       R
version.string R version 3.6.0 alpha (2019-04-07 r76333)
nickname

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"



More information about the R-devel mailing list