[Rd] special latin1 do not print as glyphs in current devel on windows
Daniel Possenriede
possenriede at gmail.com
Thu Sep 14 09:40:24 CEST 2017
This is a follow-up on my initial posts regarding character encodings on
Windows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html)
and Patrick Perry's reply
(https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in
particular (thank you for the links and the bug report!). My initial
posts were quite chaotic (and partly wrong), so I am trying to clear
things up a bit.
Actually, the title of my original message "special latin1 [characters]
do not print as glyphs in current devel on windows" is already wrong,
because the problem exists with characters with CP1252 encoding in the
80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 !=
CP1252. The characters in the 80-9F code point range are not even part
of ISO/IEC 8859-1 a.k.a. latin1, see for example
https://en.wikipedia.org/wiki/Windows-1252. R treats them as if they
were, however, and that is exactly the problem, IMHO.
Let me show you what I mean. (All output from R 3.5 r73238, see
sessionInfo at the end)
> Sys.getlocale("LC_CTYPE")
[1] "German_Germany.1252"
> x <- c("€", "ž", "š", "ü")
> sapply(x, charToRaw)
\u0080 \u009e \u009a ü
80 9e 9a fc
"€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also
show the "ü" just as an example of a non-ASCII character outside that
range (and because Patrick Perry used it in his bug report which might
be a (slightly) different problem, but I will get to that later.)
> print(x)
[1] "\u0080" "\u009e" "\u009a" "ü"
"€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for
example should be \u20ac not \u0080.
(In R 3.4.1, print(x) shows the glyphs and not the unicode escapes.
Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C
(translateCharUTF8?))?)
> print("\u20ac")
[1] "€"
The characters in x are marked as "latin1".
> Encoding(x)
[1] "latin1" "latin1" "latin1" "latin1"
Looking at the CP1252 table (e.g. link above), we see that this is
incorrect for "€", "ž", and "š", which simply do not exist in latin1.
As per the documentation, "enc2utf8 convert[s] elements of character
vectors to [...] UTF-8 [...], taking any marked encoding into account."
Since the marked encoding is wrong, so is the output of enc2utf8().
> enc2utf8(x)
[1] "\u0080" "\u009e" "\u009a" "ü"
Now, when we set the encoding to "unknown" everything works fine.
> x_un <- x
> Encoding(x_un) <- "unknown"
> print(x_un)
[1] "€" "ž" "š" "ü"
> (x_un2utf8 <- enc2utf8(x_un))
[1] "€" "ž" "š" "ü"
Long story short: The characters in the 80 to 9F range should not be
marked as "latin1" on CP1252 locales, IMHO.
As a side-note: the output of localeToCharset() is also problematic,
since ISO8859-1 != CP1252.
> localeToCharset()
[1] "ISO8859-1"
Finally on to Patrick Perry's bug report
(https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On
Windows, enc2utf8("ü") yields "|".'
Unfortunately, I cannot reproduce this with the CP1252 locale, as can be
seen above. Probably, because the bug applies to the C locale (sorry if
this is somewhere apparent in the bug report and I missed it).
> Sys.setlocale("LC_CTYPE", "C")
[1] "C"
> enc2utf8("ü")
[1] "|"
> charToRaw("ü")
[1] fc
> Encoding("ü")
[1] "unknown"
This does not seem to be related to the marked encoding of the string,
so it seems to me that this is a different problem than the one above.
Any advice on how to proceed further would be highly appreciated.
Thanks!
Daniel
> sessionInfo()
R Under development (unstable) (2017-09-11 r73238)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=C
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
More information about the R-devel
mailing list