[Rd] special latin1 do not print as glyphs in current devel on windows
Patrick Perry
pperry at stern.nyu.edu
Sun Nov 12 21:34:21 CET 2017
Just following up on this since the associated bug report just got
closed (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 )
because my original bug report was incomplete, and did not include
sessionInfo() or LC_CTYPE.
Admittedly, my original bug report was a little confused. I have since
gained a better understanding of the issue. I want to confirm that this
(a) is a real bug in base, R, not RStudio (b) provide more context. It
looks like the real issue is that R marks native strings as "latin1"
when the declared character locale is Windows-1252. This causes problems
when converting to UTF-8. See Daniel Possenriede's email below for much
more detail, including his sessionInfo() and a reproducible example .
The development version of the `stringi` package and the CRAN version of
the `utf8` package both have workarounds for this bug. (See, e.g.
https://github.com/gagolews/stringi/issues/287 and the links to the
related issues).
Patrick
> Patrick Perry <mailto:pperry at stern.nyu.edu>
> September 14, 2017 at 7:47 AM
> This particular issue has a simple fix. Currently, the
> "R_check_locale" function includes the following code starting at line
> 244 in src/main/platform.c:
>
> #ifdef Win32
> {
> char *ctype = setlocale(LC_CTYPE, NULL), *p;
> p = strrchr(ctype, '.');
> if (p && isdigit(p[1])) localeCP = atoi(p+1); else localeCP = 0;
> /* Not 100% correct, but CP1252 is a superset */
> known_to_be_latin1 = latin1locale = (localeCP == 1252);
> }
> #endif
>
> The "1252" should be "28591"; see
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
> .
>
>
> Daniel Possenriede <mailto:possenriede at gmail.com>
> September 14, 2017 at 3:40 AM
> This is a follow-up on my initial posts regarding character encodings
> on Windows
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) and
> Patrick Perry's reply
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in
> particular (thank you for the links and the bug report!). My initial
> posts were quite chaotic (and partly wrong), so I am trying to clear
> things up a bit.
>
> Actually, the title of my original message "special latin1
> [characters] do not print as glyphs in current devel on windows" is
> already wrong, because the problem exists with characters with CP1252
> encoding in the 80-9F (hex) range. Like Brian Ripley rightfully
> pointed out, latin1 != CP1252. The characters in the 80-9F code point
> range are not even part of ISO/IEC 8859-1 a.k.a. latin1, see for
> example https://en.wikipedia.org/wiki/Windows-1252. R treats them as
> if they were, however, and that is exactly the problem, IMHO.
>
> Let me show you what I mean. (All output from R 3.5 r73238, see
> sessionInfo at the end)
>
> > Sys.getlocale("LC_CTYPE")
> [1] "German_Germany.1252"
> > x <- c("€", "ž", "š", "ü")
> > sapply(x, charToRaw)
> \u0080 \u009e \u009a ü
> 80 9e 9a fc
>
> "€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also
> show the "ü" just as an example of a non-ASCII character outside that
> range (and because Patrick Perry used it in his bug report which might
> be a (slightly) different problem, but I will get to that later.)
>
> > print(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> "€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for
> example should be \u20ac not \u0080.
> (In R 3.4.1, print(x) shows the glyphs and not the unicode escapes.
> Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in
> C (translateCharUTF8?))?)
>
> > print("\u20ac")
> [1] "€"
>
> The characters in x are marked as "latin1".
>
> > Encoding(x)
> [1] "latin1" "latin1" "latin1" "latin1"
>
> Looking at the CP1252 table (e.g. link above), we see that this is
> incorrect for "€", "ž", and "š", which simply do not exist in latin1.
>
> As per the documentation, "enc2utf8 convert[s] elements of character
> vectors to [...] UTF-8 [...], taking any marked encoding into
> account." Since the marked encoding is wrong, so is the output of
> enc2utf8().
>
> > enc2utf8(x)
> [1] "\u0080" "\u009e" "\u009a" "ü"
>
> Now, when we set the encoding to "unknown" everything works fine.
>
> > x_un <- x
> > Encoding(x_un) <- "unknown"
> > print(x_un)
> [1] "€" "ž" "š" "ü"
> > (x_un2utf8 <- enc2utf8(x_un))
> [1] "€" "ž" "š" "ü"
>
> Long story short: The characters in the 80 to 9F range should not be
> marked as "latin1" on CP1252 locales, IMHO.
>
> As a side-note: the output of localeToCharset() is also problematic,
> since ISO8859-1 != CP1252.
>
> > localeToCharset()
> [1] "ISO8859-1"
>
> Finally on to Patrick Perry's bug report
> (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On
> Windows, enc2utf8("ü") yields "|".'
>
> Unfortunately, I cannot reproduce this with the CP1252 locale, as can
> be seen above. Probably, because the bug applies to the C locale
> (sorry if this is somewhere apparent in the bug report and I missed it).
>
> > Sys.setlocale("LC_CTYPE", "C")
> [1] "C"
> > enc2utf8("ü")
> [1] "|"
> > charToRaw("ü")
> [1] fc
> > Encoding("ü")
> [1] "unknown"
>
> This does not seem to be related to the marked encoding of the string,
> so it seems to me that this is a different problem than the one above.
>
> Any advice on how to proceed further would be highly appreciated.
>
> Thanks!
> Daniel
>
> > sessionInfo()
> R Under development (unstable) (2017-09-11 r73238)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=C
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
> [5] LC_TIME=German_Germany.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.5.0
>
> Patrick Perry <mailto:pperry at stern.nyu.edu>
> August 27, 2017 at 11:40 AM
> Regarding the Windows character encoding issues Daniel Possenriede
> posted about earlier this month, where non-Latin-1 strings were
> getting marked as such
> (https://stat.ethz.ch/pipermail/r-devel/2017-August/074731.html ):
>
> The issue is that on Windows, when the character locale is
> Windows-1252, R marks some (possibly all) native non-ASCII strings as
> "latin1". I posted a related bug report:
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329 . The bug
> report also includes a link to a fix for a related issue: converting
> strings from Windows native to UTF-8.
>
> There is a work-around for this bug in the current development version
> of the 'corpus' package (not on CRAN yet). See
> https://github.com/patperry/r-corpus/issues/5 . I have tested this on
> a Windows-1252 install of R, but I have not tested it on a Windows
> install in another locale. It'd be great if someone with such an
> install would test the fix and report back, either here or on the
> github issue.
>
>
> Patrick
[[alternative HTML version deleted]]
More information about the R-devel
mailing list