[Rd] special latin1 do not print as glyphs in current devel on windows
Patrick Perry
pperry at stern.nyu.edu
Thu Sep 14 13:47:33 CEST 2017
This particular issue has a simple fix. Currently, the "R_check_locale"
function includes the following code starting at line 244 in
src/main/platform.c:
#ifdef Win32
{
char *ctype = setlocale(LC_CTYPE, NULL), *p;
p = strrchr(ctype, '.');
if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0;
/* Not 100% correct, but CP1252 is a superset */
known_to_be_latin1 = latin1locale = (localeCP == 1252);
}
#endif
The "1252" should be "28591"; see
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
.
> Daniel Possenriede <mailto:possenriede at gmail.com>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings
> on Windows
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and
> Patrick Perry's reply
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in
> particular (thank you for the links and the bug report!). My initial
> posts were quite chaotic (and partly wrong), so I am trying to clear
> things up a bit.
>
> Actually, the title of my original message "special latin1
> [characters] do not print as glyphs in current devel on windows" is
> already wrong, because the problem exists with characters with CP1252
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully
> pointed out, latin1 != CP1252. The characters in the 80-9F code point
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for
> example https://en.wikipedia.org/wiki/Windows-1252. R treats them as
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("€", "ž", "š", "ü")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a ü
> 80 9e 9a fc
>
> "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also
> show the "ü" just as an example of a non-ASCII character outside that
> range (and because Patrick Perry used it in his bug report which might
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes.
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "€"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1" "latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is
> incorrect for "€", "ž", and "š", which simply do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character
> vectors to [...] UTF-8 [...], taking any marked encoding into
> account." Since the marked encoding is wrong, so is the output of
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "€" "ž" "š" "ü"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "€" "ž" "š" "ü"
>
> Long story short: The characters in the 80 to 9F range should not be
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic,
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report
> (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On
> Windows, enc2utf8("ü") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can
> be seen above. Probably, because the bug applies to the C locale
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("ü")
> [1] "|"
> > charToRaw("ü")
> [1] fc
> > Encoding("ü")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string,
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list