[Rd] special latin1 do not print as glyphs in current devel on windows

Thu Sep 14 13:47:33 CEST 2017

This particular issue has a simple fix. Currently, the "R_check_locale" 
function includes the following code starting at line 244 in 
src/main/platform.c:

#ifdef Win32
     {
     char *ctype = setlocale(LC_CTYPE, NULL), *p;
     p = strrchr(ctype, '.');
     if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0;
     /* Not 100% correct, but CP1252 is a superset */
     known_to_be_latin1 = latin1locale = (localeCP == 1252);
     }
#endif

The "1252" should be "28591"; see 
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx 
.

> Daniel Possenriede <mailto:possenriede at gmail.com>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings 
> on Windows 
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and 
> Patrick Perry's reply 
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in 
> particular (thank you for the links and the bug report!). My initial 
> posts were quite chaotic (and partly wrong), so I am trying to clear 
> things up a bit.
>
> Actually, the title of my original message "special latin1 
> [characters] do not print as glyphs in current devel on windows" is 
> already wrong, because the problem exists with characters with CP1252 
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully 
> pointed out, latin1 != CP1252. The characters in the 80-9F code point 
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for 
> example https://en.wikipedia.org/wiki/Windows-1252. R treats them as 
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see 
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("€", "ž", "š", "ü")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a  ü
> 80 9e 9a fc
>
> "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also 
> show the "ü" just as an example of a non-ASCII character outside that 
> range (and because Patrick Perry used it in his bug report which might 
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for 
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. 
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in 
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "€"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1" "latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is 
> incorrect for "€", "ž", and "š", which simply do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character 
> vectors to [...] UTF-8 [...], taking any marked encoding into 
> account." Since the marked encoding is wrong, so is the output of 
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "€" "ž" "š" "ü"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "€" "ž" "š" "ü"
>
> Long story short: The characters in the 80 to 9F range should not be 
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic, 
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report 
> (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On 
> Windows, enc2utf8("ü") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can 
> be seen above. Probably, because the bug applies to the C locale 
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("ü")
> [1] "|"
> > charToRaw("ü")
> [1] fc
> > Encoding("ü")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string, 
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>

	[[alternative HTML version deleted]]