[Rd] A few problems with Sys.setLanguage()

Ivan Krylov |kry|ov @end|ng |rom d|@root@org
Tue Feb 11 14:59:27 CET 2025


Hello R-devel,

Currently, Sys.setLanguage() interprets an empty/absent environment
variable LANGUAGE to mean unset="en", which disagrees with gettext():
it defaults to the LC_MESSAGES category of the current locale [1]. As a
result, on systems with $LANGUAGE normally unset, Sys.setLanguage(...)
returns "en" instead of the language previously in effect. I would like
to suggest making the default unset = Sys.getlocale("LC_MESSAGES")
instead of "en" so that Sys.setLanguage(Sys.setLanguage(anything))
would not reset language to English. Making Sys.setLanguage() accept an
empty string or NA to reset or remove LANGUAGE (and allowing
Sys.setLanguage() to return that value) could also be an option.

Additionally, there is a number of problems with the way
Sys.setLanguage() handles R having started up in the C locale, some of
them easier to solve than others.

gettext() disables translation lookup only when the LC_MESSAGES locale
category is "C" or "POSIX", so the current test for identical("C",
Sys.getlocale()) will miss the situations when not all locale
categories are set to "C". I think the correct test should be
Sys.getlocale("LC_MESSAGES") %in% c("C", "POSIX", "C.UTF-8", "C.utf8").
(On my GNU/Linux system, setting a "POSIX" locale returns it as "C",
but I don't think that's guaranteed to happen everywhere.)

So what should Sys.setLanguage(lang, force=TRUE) do when the current
LC_MESSAGES locale category disables translation? "en_US.UTF-8" is not
guaranteed to be present on a given system. POSIX documents 'locale -a'
to list available locales [2], so R could attempt something like:

# any locales except C.*/POSIX which disable translation?
system("locale -a", intern = TRUE) |>
 setdiff(c("C", "C.UTF-8", "C.utf8", "POSIX")) -> candidates
locale <- if (any(mask <- startsWith(candidates, lang))) {
 candidates[mask][[1]]
} else if (length(candidates)) {
 candidates[[1]]
} else {
 "en_US.UTF-8" # maybe it's available despite 'locale -a' failing?
}
lcSet <- Sys.setlocale("LC_MESSAGES", locale)

Unfortunately, that's not all: translations are also affected by the
LC_CTYPE category of the current locale, and gettext() will try to
convert the translations into that locale's encoding before returning
them. What about LC_CTYPE being "C"? Sometimes gettext() is able to
transliterate:

$ LC_CTYPE=C LANGUAGE=ru R -q -s -e 'foo'
Oshibka: ob``ekt 'foo' ne najden
Vy`polnenie ostanovleno

And sometimes it's not:

$ LC_CTYPE=C LANGUAGE=zh_CN R -q -s -e 'foo'
??: ?????'foo'
???? # <-- these are \x3F question marks, not replacement characters

There doesn't seem to be a portable way to determine a locale with an
encoding that would be appropriate in the current session. For example,
on my system, only 4 locales out of 11 listed by 'locale -a' use UTF-8
as their encoding (and sometimes UTF-8 is the wrong choice when I'm
using 'luit' with a non-UTF-8 environment).

R could try to force the same locale for LC_CTYPE as it sets
LC_MESSAGES, or force a UTF-8 locale if it finds one, or leave LC_CTYPE
as it is. All of these options have their downsides. How helpful is
Sys.setLanguage(force = TRUE) in practice? 

-- 
Best regards,
Ivan

[1] The environment variables used for gettext() are listed at the
following resources:
https://www.gnu.org/software/gettext/manual/html_node/Locale-Environment-Variables.html
https://pubs.opengroup.org/onlinepubs/9799919799/basedefs/V1_chap08.html#tag_08_02
The exact lookup procedure is also documented here:
https://pubs.opengroup.org/onlinepubs/9799919799/functions/dngettext.html
In short, if the LC_MESSAGES category of the current locale is
"C" or "POSIX", gettext() does not translate. (GNU gettext additionally
disables translation for "C.UTF-8".) Otherwise it consults the LANGUAGE
environment variable. If that variable is absent or empty, it uses the
LC_MESSAGES category of the current locale. When a program calls
setlocale(category, ""), $LANG provides the default value for all
categories, which is overridden by the $LC_* variables for individual
categories, which are all overridden by $LC_ALL.

[2]
https://pubs.opengroup.org/onlinepubs/9799919799/utilities/locale.html



More information about the R-devel mailing list