icuSetCollate {base} | R Documentation |
Controls the way collation is done by ICU (an optional part of the R build).
icuSetCollate(...)
icuGetCollate(type = c("actual", "valid"))
... |
named arguments, see ‘Details’. |
type |
a character string: either the |
Optionally, R can be built to collate character strings by ICU
(https://icu.unicode.org/). For such systems,
icuSetCollate
can be used to tune the way collation is done.
On other builds calling this function does nothing, with a warning.
Possible arguments are
locale
:A character string such as "da_DK"
giving the language and country whose collation rules are to be
used. If present, this should be the first argument.
case_first
:"upper"
, "lower"
or
"default"
, asking for upper- or lower-case characters to be
sorted first. The default is usually lower-case first, but not in
all languages (not under the default settings for Danish, for example).
alternate_handling
:Controls the handling of
‘variable’ characters (mainly punctuation and symbols).
Possible values are "non_ignorable"
(primary strength) and
"shifted"
(quaternary strength).
strength
:Which components should be used? Possible
values "primary"
, "secondary"
, "tertiary"
(default), "quaternary"
and "identical"
.
french_collation
:In a French locale the way accents
affect collation is from right to left, whereas in most other locales
it is from left to right. Possible values "on"
, "off"
and "default"
.
normalization
:Should strings be normalized? Possible values
are "on"
and "off"
(default). This affects the
collation of composite characters.
case_level
:An additional level between secondary and
tertiary, used to distinguish large and small Japanese Kana
characters. Possible values "on"
and "off"
(default).
hiragana_quaternary
:Possible values "on"
(sort
Hiragana first at quaternary level) and "off"
.
Only the first three are likely to be of interest except to those with a detailed understanding of collation and specialized requirements.
Some special values are accepted for locale
:
"none"
:ICU is not used for collation: the OS's collation services are used instead.
"ASCII"
:ICU is not used for collation: the C function
strcmp
is used instead, which should sort byte-by-byte in
(unsigned) numerical order.
"default"
:obtains the locale from the OS as is done at the start of the session (except on Windows). If environment variable R_ICU_LOCALE is set to a non-empty value, its value is used rather than consulting the OS, unless environment variable LC_ALL is set to 'C' (or unset but LC_COLLATE is set to 'C').
""
, "root"
:the ‘root’ collation: see https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation.
For the specifications of ‘real’ ICU locales, see
https://unicode-org.github.io/icu/userguide/locale/. Note that ICU does not
report that a locale is not supported, but falls back to its idea of
‘best fit’ (which could be rather different and is reported by
icuGetCollate("actual")
, often "root"
). Most English
locales fall back to "root"
as although e.g. "en_GB"
is
a valid locale (at least on some platforms), it contains no special
rules for collation. Note that "C"
is not a supported ICU locale
and hence R_ICU_LOCALE should never be set to "C"
.
Some examples are case_level = "on", strength = "primary"
to ignore
accent differences and alternate_handling = "shifted"
to ignore
space and punctuation characters.
Initially ICU will not be used for collation if the OS is set to use the
C
locale for collation and R_ICU_LOCALE is not set. Once
this function is called with a value for locale
, ICU will be used
until it is called again with locale = "none"
. ICU will not be
used once Sys.setlocale
is called with a "C"
value for
LC_ALL
or LC_COLLATE
, even if R_ICU_LOCALE is set.
ICU will be used again honoring R_ICU_LOCALE once
Sys.setlocale
is called to set a different collation order.
Environment variables LC_ALL (or LC_COLLATE) take precedence
over R_ICU_LOCALE if and only if they are set to 'C'. Due to the
interaction with other ways of setting the collation order,
R_ICU_LOCALE should be used with care and only when needed.
All customizations are reset to the default for the locale if
locale
is specified: the collation engine is reset if the
OS collation locate category is changed by Sys.setlocale
.
For icuGetCollate
, a character string describing the ICU locale
in use (which may be reported as "ICU not in use"
). The
‘actual’ locale may be simpler than the requested locale: for
example "da"
rather than "da_DK"
: English locales are
likely to report "root"
.
Except on Windows, ICU is used by default wherever it is available. As it works internally in UTF-8, it will be most efficient in UTF-8 locales.
On Windows, R is normally built including ICU, but it will only be
used if environment variable R_ICU_LOCALE had been set when R
is started or after icuSetCollate
is called to select the
locale (as ICU and Windows differ in their idea of locale names).
Note that icuSetCollate(locale = "default")
should work
reasonably well, but finds the system default ignoring environment
variables such as LC_COLLATE.
capabilities
for whether ICU is available;
extSoftVersion
for its version.
The ICU user guide chapter on collation (https://unicode-org.github.io/icu/userguide/collation/).
## These examples depend on having ICU available, and on the locale.
## As we don't know the current settings, we can only reset to the default.
if(capabilities("ICU")) withAutoprint({
icuGetCollate()
icuGetCollate("valid")
x <- c("Aarhus", "aarhus", "safe", "test", "Zoo")
sort(x)
icuSetCollate(case_first = "upper"); sort(x)
icuSetCollate(case_first = "lower"); sort(x)
## Danish collates upper-case-first and with 'aa' as a single letter
icuSetCollate(locale = "da_DK", case_first = "default"); sort(x)
## Estonian collates Z between S and T
icuSetCollate(locale = "et_EE"); sort(x)
icuSetCollate(locale = "default"); icuGetCollate("valid")
})