[Rd] localeToCharset()

Mon Jan 31 14:08:11 CET 2022

Dear Tomas,

thanks a lot. I do understand the explanation of Simon - I was not aware of the standardization issue. My conclusion is that I should rely on another approach to detect the session charset, and your suggestions are my first option.

My final thought: For users who do not know the POSIX standards and recent aberrations , a warning might be helpful, something such as:
If (startsWith(locale, "C.")) warning (sprintf("%s is a non-standard locale", locale))

As far as I am concerned, I take away a lot from this discussion! Thank you!

Kind regards
Andreas 

Am 31.01.22, 13:32 schrieb "Tomas Kalibera" <tomas.kalibera using gmail.com>:

    Hi Andreas,

    is there still any higher-level problem left you need to solve? Ideally 
    one wouldn't need to query what is the native encoding, but directly use 
    iconv() or indirectly other R functions to convert the data from/to the 
    native encoding. iconv() will find out internally what is the native 
    encoding (via data that is available also by l10n_info(), but with care 
    for differences between OSes).

    Best
    Tomas

    On 1/31/22 12:38, Blätte, Andreas wrote:
    > Dear Ivan,
    >
    > this is a very helpful explanation!  I think it is important to make output of localeToCharset() more predictable. My problem is essentially not to set the locale such that things will work after all. I think the problem is that you see unexpected results.  I guess I owe a suggestion how to improve the code, but your suggestion looks like a very good starting point.
    >
    > Andreas
    >
    > Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t using gmail.com>:
    >
    >      On Mon, 31 Jan 2022 09:56:27 +0000
    >      "Blätte, Andreas" <andreas.blaette using uni-due.de> wrote:
    >
    >      > After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
    >      > R`,  the output of `localeToCharset()` is:
    >      > [1] "UTF-8"     "ISO8859-1"
    >
    >      > why ISO8859-1 might be a fallback option here?
    >
    >      ISO8859-1 seems to be offered because it covers the alphabet of
    >      American English. Obviously, this doesn't guarantee that the guess is
    >      correct. For example, I could symlink the ru_RU.KOI8-R locale on my
    >      system to name it "ru_RU", and localeToCharset() would return
    >      "ISO8859-5", not knowing the correct answer. їЯавЯг, anyone?
    >
    >      > Part of my analysis of the code of `localeToCharset()` is that it
    >      > targets special scenarios on Windows and macOS, but not on Linux.
    >
    >      Well, it almost does the right thing. GNU/Linux locales are typically
    >      named like <language>_<country>.<encoding>, and localeToCharset()
    >      respects the <encoding> part, but only if the language and the country
    >      are specified. A quick fix for that would be to add one final case:
    >
    >      Index: src/library/utils/R/iconv.R
    >      ===================================================================
    >      --- src/library/utils/R/iconv.R (revision 81596)
    >      +++ src/library/utils/R/iconv.R (working copy)
    >      @@ -135,6 +135,7 @@
    >                   if(enc == "utf8") return(c("UTF-8", guess(ll)))
    >                   else return(guess(ll))
    >               }
    >      +        if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8
    >               return(NA_character_)
    >           }
    >       }
    >
    >      (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
    >      && enc != "utf8") branch.)
    >
    >      Maybe a better fix would be to restructure the code a bit, to always
    >      take the encoding hint and then also try to guess if the locale looks
    >      like it provides a language code.
    >
    >      --
    >      Best regards,
    >      Ivan
    >
    > ______________________________________________
    > R-devel using r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel