[Rd] localeToCharset()

Mon Jan 31 12:38:45 CET 2022

Dear Ivan,

this is a very helpful explanation!  I think it is important to make output of localeToCharset() more predictable. My problem is essentially not to set the locale such that things will work after all. I think the problem is that you see unexpected results.  I guess I owe a suggestion how to improve the code, but your suggestion looks like a very good starting point. 

Andreas 

Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t using gmail.com>:

    On Mon, 31 Jan 2022 09:56:27 +0000
    "Blätte, Andreas" <andreas.blaette using uni-due.de> wrote:

    > After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
    > R`,  the output of `localeToCharset()` is:
    > [1] "UTF-8"     "ISO8859-1"

    > why ISO8859-1 might be a fallback option here?

    ISO8859-1 seems to be offered because it covers the alphabet of
    American English. Obviously, this doesn't guarantee that the guess is
    correct. For example, I could symlink the ru_RU.KOI8-R locale on my
    system to name it "ru_RU", and localeToCharset() would return
    "ISO8859-5", not knowing the correct answer. їЯавЯг, anyone?

    > Part of my analysis of the code of `localeToCharset()` is that it
    > targets special scenarios on Windows and macOS, but not on Linux.

    Well, it almost does the right thing. GNU/Linux locales are typically
    named like <language>_<country>.<encoding>, and localeToCharset()
    respects the <encoding> part, but only if the language and the country
    are specified. A quick fix for that would be to add one final case:

    Index: src/library/utils/R/iconv.R
    ===================================================================
    --- src/library/utils/R/iconv.R (revision 81596)
    +++ src/library/utils/R/iconv.R (working copy)
    @@ -135,6 +135,7 @@
                 if(enc == "utf8") return(c("UTF-8", guess(ll)))
                 else return(guess(ll))
             }
    +        if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8
             return(NA_character_)
         }
     }

    (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
    && enc != "utf8") branch.)

    Maybe a better fix would be to restructure the code a bit, to always
    take the encoding hint and then also try to guess if the locale looks
    like it provides a language code.

    -- 
    Best regards,
    Ivan