[Rd] localeToCharset()
Blätte, Andreas
@ndre@@@b|@ette @end|ng |rom un|-due@de
Mon Jan 31 10:56:27 CET 2022
Dear all,
packages for processing text may need information on the charset of the R session. In my packages RcppCWB and polmineR, I extract this information from the locale using `localeToCharset()`. But when running cross-platform checks (Github Actions and Docker), I recurringly encounter unexpected behavior of `localeToCharset()`.
As a a reproducible example, I suggest to use a local Fedora (latest) container, starting as follows:
docker pull fedora:latest
docker run -it fedora:latest /bin/bash
After installing R (`yum install -y R`) and starting R, `localeToCharset()` returns `NA`. However, the part of sessionInfo() on the locale is as follows:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
If I run R CMD check on any arbitrary package in this environment at this stage, I see:
* using session charset: UTF-8
The documentation says however: ‚In the C locale the answer will be "ASCII".’ Why not UTF-8 in this case?
The `localeToCharset()` function is also confusing for me, when I explicitly re-define the locale. In my fresh Fedora docker container, I need to install English-language locales first:
dnf install langpacks-en
After starting R with a re-defined locale (`env LC_CTYPE=en_US.UTF-8 R`, the output of `localeToCharset()` is:
[1] "UTF-8" "ISO8859-1"
The “Value” section of the documentation says: “A character vector naming an encoding and possibly a fallback single-encoding, NA if unknown.” But I do not understand why ISO8859-1 might be a fallback option here?
I do not know whether this is just a matter of documentation? My intuition is that `localeToCharset()` should work differently. At the moment, I need to rely on a few workarounds to cope with the behavior I do not understand. (Or is there a better function to detect the encoding of the R session?)
Part of my analysis of the code of `localeToCharset()` is that it targets special scenarios on Windows and macOS, but not on Linux.
Kind regards
Andreas
--
Prof. Dr. Andreas Blaette
Professor of Public Policy and Regional Politics
University of Duisburg-Essen
[[alternative HTML version deleted]]
More information about the R-devel
mailing list