[Rd] special latin1 do not print as glyphs in current devel on windows

Sun Nov 12 21:34:21 CET 2017

Just following up on this since the associated bug report just got 
closed (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 ) 
because my original bug report was incomplete, and did not include 
sessionInfo() or LC_CTYPE.

Admittedly, my original bug report was a little confused. I have since 
gained a better understanding of the issue. I want to confirm that this 
(a) is a real bug in base, R, not RStudio (b) provide more context. It 
looks like the real issue is that R marks native strings as "latin1" 
when the declared character locale is Windows-1252. This causes problems 
when converting to UTF-8. See Daniel Possenriede's email below for much 
more detail, including his sessionInfo() and a reproducible example .

The development version of the `stringi` package and the CRAN version of 
the `utf8` package both have workarounds for this bug. (See, e.g. 
https://github.com/gagolews/stringi/issues/287 and the links to the 
related issues).

Patrick

> Patrick Perry <mailto:pperry at stern.nyu.edu>
> September 14, 2017 at 7:47 AM
> This particular issue has a simple fix. Currently, the 
> "R_check_locale" function includes the following code starting at line 
> 244 in src/main/platform.c:
>
> #ifdef Win32
>     {
>     char *ctype = setlocale(LC_CTYPE, NULL), *p;
>     p = strrchr(ctype, '.');
>     if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0;
>     /* Not 100% correct, but CP1252 is a superset */
>     known_to_be_latin1 = latin1locale = (localeCP == 1252);
>     }
> #endif
>
> The "1252" should be "28591"; see 
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx 
> .
>
>
> Daniel Possenriede <mailto:possenriede at gmail.com>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings 
> on Windows 
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and 
> Patrick Perry's reply 
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in 
> particular (thank you for the links and the bug report!). My initial 
> posts were quite chaotic (and partly wrong), so I am trying to clear 
> things up a bit.
>
> Actually, the title of my original message "special latin1 
> [characters] do not print as glyphs in current devel on windows" is 
> already wrong, because the problem exists with characters with CP1252 
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully 
> pointed out, latin1 != CP1252. The characters in the 80-9F code point 
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for 
> example https://en.wikipedia.org/wiki/Windows-1252. R treats them as 
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see 
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("€", "ž", "š", "ü")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a  ü
> 80 9e 9a fc
>
> "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also 
> show the "ü" just as an example of a non-ASCII character outside that 
> range (and because Patrick Perry used it in his bug report which might 
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for 
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. 
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in 
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "€"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1" "latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is 
> incorrect for "€", "ž", and "š", which simply do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character 
> vectors to [...] UTF-8 [...], taking any marked encoding into 
> account." Since the marked encoding is wrong, so is the output of 
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "€" "ž" "š" "ü"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "€" "ž" "š" "ü"
>
> Long story short: The characters in the 80 to 9F range should not be 
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic, 
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report 
> (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On 
> Windows, enc2utf8("ü") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can 
> be seen above. Probably, because the bug applies to the C locale 
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("ü")
> [1] "|"
> > charToRaw("ü")
> [1] fc
> > Encoding("ü")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string, 
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>
> Patrick Perry <mailto:pperry at stern.nyu.edu>
> August 27, 2017 at 11:40 AM
> Regarding the Windows character encoding issues Daniel Possenriede 
> posted about earlier this month, where non-Latin-1 strings were 
> getting marked as such 
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074731.html ):
>
> The issue is that on Windows, when the character locale is 
> Windows-1252, R marks some (possibly all) native non-ASCII strings as 
> "latin1". I posted a related bug report: 
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 . The bug 
> report also includes a link to a fix for a related issue: converting 
> strings from Windows native to UTF-8.
>
> There is a work-around for this bug in the current development version 
> of the 'corpus' package (not on CRAN yet). See 
> https://github.com/patperry/r-corpus/issues/5 . I have tested this on 
> a Windows-1252 install of R, but I have not tested it on a Windows 
> install in another locale. It'd be great if someone with such an 
> install would test the fix and report back, either here or on the 
> github issue.
>
>
> Patrick

	[[alternative HTML version deleted]]