[Rd] localeToCharset()
Simon Urbanek
@|mon@urb@nek @end|ng |rom R-project@org
Mon Jan 31 13:16:03 CET 2022
Andreas,
The output is very predictable, so this is not about predictability. Note that C.UTF-8 is technically an invalid locale by the semantics rules (see below). Also note that the C locale is "C" - it is not allowed to have any string behind the C (or else is not the C locale) so what you have is NOT a C locale (see POSIX 7.2).
The issue here is that the POSIX standard provides no semantic rules, locale names can be arbitrary, the only defined one is C (and its synonym POSIX). All others are random locales that can do whatever they want. Then later some systems have introduced semantic guidelines such as the <language>_<territory>.<codeset> convention - that that is what localeToCharsets() expected so it can try to guess the charset for that language. Since C.UTF-8 is such an aberration (not in the standard form) localeToCharset() doesn't know about it and returns NA since it can't guess the language.
Long story short, C.UTF-8 breaks all common rules and has been introduced fairly recently to some Linux systems so R doesn't not know about it yet. Ivan's patch fixes that. That aside, locale names have no official provision to provide the charset, so all you get is a guess assuming the system follows the common rules.
Cheers,
Simon
> On Feb 1, 2022, at 00:38, Blätte, Andreas <andreas.blaette using uni-due.de> wrote:
>
> Dear Ivan,
>
> this is a very helpful explanation! I think it is important to make output of localeToCharset() more predictable. My problem is essentially not to set the locale such that things will work after all. I think the problem is that you see unexpected results. I guess I owe a suggestion how to improve the code, but your suggestion looks like a very good starting point.
>
> Andreas
>
> Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t using gmail.com>:
>
> On Mon, 31 Jan 2022 09:56:27 +0000
> "Blätte, Andreas" <andreas.blaette using uni-due.de> wrote:
>
>> After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
>> R`, the output of `localeToCharset()` is:
>> [1] "UTF-8" "ISO8859-1"
>
>> why ISO8859-1 might be a fallback option here?
>
> ISO8859-1 seems to be offered because it covers the alphabet of
> American English. Obviously, this doesn't guarantee that the guess is
> correct. For example, I could symlink the ru_RU.KOI8-R locale on my
> system to name it "ru_RU", and localeToCharset() would return
> "ISO8859-5", not knowing the correct answer. їЯавЯг, anyone?
>
>> Part of my analysis of the code of `localeToCharset()` is that it
>> targets special scenarios on Windows and macOS, but not on Linux.
>
> Well, it almost does the right thing. GNU/Linux locales are typically
> named like <language>_<country>.<encoding>, and localeToCharset()
> respects the <encoding> part, but only if the language and the country
> are specified. A quick fix for that would be to add one final case:
>
> Index: src/library/utils/R/iconv.R
> ===================================================================
> --- src/library/utils/R/iconv.R (revision 81596)
> +++ src/library/utils/R/iconv.R (working copy)
> @@ -135,6 +135,7 @@
> if(enc == "utf8") return(c("UTF-8", guess(ll)))
> else return(guess(ll))
> }
> + if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8
> return(NA_character_)
> }
> }
>
> (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
> && enc != "utf8") branch.)
>
> Maybe a better fix would be to restructure the code a bit, to always
> take the encoding hint and then also try to guess if the locale looks
> like it provides a language code.
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list