[Rd] special latin1 do not print as glyphs in current devel on windows

Thu Sep 14 09:40:24 CEST 2017

This is a follow-up on my initial posts regarding character encodings on 
Windows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) 
and Patrick Perry's reply 
(https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in 
particular (thank you for the links and the bug report!). My initial 
posts were quite chaotic (and partly wrong), so I am trying to clear 
things up a bit.

Actually, the title of my original message "special latin1 [characters] 
do not print as glyphs in current devel on windows" is already wrong, 
because the problem exists with characters with CP1252 encoding in the 
80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 != 
CP1252. The characters in the 80-9F code point range are not even part 
of ISO/IEC 8859-1 a.k.a. latin1, see for example 
https://en.wikipedia.org/wiki/Windows-1252. R treats them as if they 
were, however, and that is exactly the problem, IMHO.

Let me show you what I mean. (All output from R 3.5 r73238, see 
sessionInfo at the end)

 > Sys.getlocale("LC_CTYPE")
[1] "German_Germany.1252"
 > x <- c("€", "ž", "š", "ü")
 > sapply(x, charToRaw)
\u0080 \u009e \u009a  ü
80 9e 9a fc

"€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also 
show the "ü" just as an example of a non-ASCII character outside that 
range (and because Patrick Perry used it in his bug report which might 
be a (slightly) different problem, but I will get to that later.)

 > print(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

"€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for 
example should be \u20ac not \u0080.
(In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. 
Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C 
(translateCharUTF8?))?)

 > print("\u20ac")
[1] "€"

The characters in x are marked as "latin1".

 > Encoding(x)
[1] "latin1" "latin1" "latin1" "latin1"

Looking at the CP1252 table (e.g. link above), we see that this is 
incorrect for "€", "ž", and "š", which simply do not exist in latin1.

As per the documentation, "enc2utf8 convert[s] elements of character 
vectors to [...] UTF-8 [...], taking any marked encoding into account." 
Since the marked encoding is wrong, so is the output of enc2utf8().

 > enc2utf8(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

Now, when we set the encoding to "unknown" everything works fine.

 > x_un <- x
 > Encoding(x_un) <- "unknown"
 > print(x_un)
[1] "€" "ž" "š" "ü"
 > (x_un2utf8 <- enc2utf8(x_un))
[1] "€" "ž" "š" "ü"

Long story short: The characters in the 80 to 9F range should not be 
marked as "latin1" on CP1252 locales, IMHO.

As a side-note: the output of localeToCharset() is also problematic, 
since ISO8859-1 != CP1252.

 > localeToCharset()
[1] "ISO8859-1"

Finally on to Patrick Perry's bug report 
(https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On 
Windows, enc2utf8("ü") yields "|".'

Unfortunately, I cannot reproduce this with the CP1252 locale, as can be 
seen above. Probably, because the bug applies to the C locale (sorry if 
this is somewhere apparent in the bug report and I missed it).

 > Sys.setlocale("LC_CTYPE", "C")
[1] "C"
 > enc2utf8("ü")
[1] "|"
 > charToRaw("ü")
[1] fc
 > Encoding("ü")
[1] "unknown"

This does not seem to be related to the marked encoding of the string, 
so it seems to me that this is a different problem than the one above.

Any advice on how to proceed further would be highly appreciated.

Thanks!
Daniel

 > sessionInfo()
R Under development (unstable) (2017-09-11 r73238)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0