[Rd] localeToCharset()
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu Feb 10 17:54:25 CET 2022
Thanks to Ivan for the patch to support C.UTF-8 in localeToCharset, I've
added it to R-devel.
On 1/31/22 14:08, Blätte, Andreas wrote:
> Dear Tomas,
>
> thanks a lot. I do understand the explanation of Simon - I was not aware of the standardization issue. My conclusion is that I should rely on another approach to detect the session charset, and your suggestions are my first option.
>
> My final thought: For users who do not know the POSIX standards and recent aberrations , a warning might be helpful, something such as:
> If (startsWith(locale, "C.")) warning (sprintf("%s is a non-standard locale", locale))
Dear Andreas, "C" and "POSIX" (and "") are the only two locales with
standard names (defined by POSIX), so people necessarily have to rely on
the non-standard ones and when new ones are introduced, such as in this
case, we need to update localeToCharset() to support them. Thanks for
your report.
Best
Tomas
>
> As far as I am concerned, I take away a lot from this discussion! Thank you!
>
> Kind regards
> Andreas
>
>
> Am 31.01.22, 13:32 schrieb "Tomas Kalibera" <tomas.kalibera using gmail.com>:
>
> Hi Andreas,
>
> is there still any higher-level problem left you need to solve? Ideally
> one wouldn't need to query what is the native encoding, but directly use
> iconv() or indirectly other R functions to convert the data from/to the
> native encoding. iconv() will find out internally what is the native
> encoding (via data that is available also by l10n_info(), but with care
> for differences between OSes).
>
> Best
> Tomas
>
> On 1/31/22 12:38, Blätte, Andreas wrote:
> > Dear Ivan,
> >
> > this is a very helpful explanation! I think it is important to make output of localeToCharset() more predictable. My problem is essentially not to set the locale such that things will work after all. I think the problem is that you see unexpected results. I guess I owe a suggestion how to improve the code, but your suggestion looks like a very good starting point.
> >
> > Andreas
> >
> > Am 31.01.22, 12:32 schrieb "Ivan Krylov" <krylov.r00t using gmail.com>:
> >
> > On Mon, 31 Jan 2022 09:56:27 +0000
> > "Blätte, Andreas" <andreas.blaette using uni-due.de> wrote:
> >
> > > After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8
> > > R`, the output of `localeToCharset()` is:
> > > [1] "UTF-8" "ISO8859-1"
> >
> > > why ISO8859-1 might be a fallback option here?
> >
> > ISO8859-1 seems to be offered because it covers the alphabet of
> > American English. Obviously, this doesn't guarantee that the guess is
> > correct. For example, I could symlink the ru_RU.KOI8-R locale on my
> > system to name it "ru_RU", and localeToCharset() would return
> > "ISO8859-5", not knowing the correct answer. їЯавЯг, anyone?
> >
> > > Part of my analysis of the code of `localeToCharset()` is that it
> > > targets special scenarios on Windows and macOS, but not on Linux.
> >
> > Well, it almost does the right thing. GNU/Linux locales are typically
> > named like <language>_<country>.<encoding>, and localeToCharset()
> > respects the <encoding> part, but only if the language and the country
> > are specified. A quick fix for that would be to add one final case:
> >
> > Index: src/library/utils/R/iconv.R
> > ===================================================================
> > --- src/library/utils/R/iconv.R (revision 81596)
> > +++ src/library/utils/R/iconv.R (working copy)
> > @@ -135,6 +135,7 @@
> > if(enc == "utf8") return(c("UTF-8", guess(ll)))
> > else return(guess(ll))
> > }
> > + if (enc == "utf8") return("UTF-8") # fallback for ???.UTF-8
> > return(NA_character_)
> > }
> > }
> >
> > (Non-UTF-8 encodings on POSIX are handled above, in the if(nzchar(enc)
> > && enc != "utf8") branch.)
> >
> > Maybe a better fix would be to restructure the code a bit, to always
> > take the encoding hint and then also try to guess if the locale looks
> > like it provides a language code.
> >
> > --
> > Best regards,
> > Ivan
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list