[Rd] why does [A-Z] include 'T' in an Estonian locale?
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Thu Jun 1 10:11:22 CEST 2023
>>>>> Ben Bolker
>>>>> on Tue, 30 May 2023 11:45:20 -0400 writes:
> Inspired by this old Stack Overflow question
> https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
> I was wondering why this is TRUE:
> Sys.setlocale("LC_ALL", "et_EE")
> grepl("[A-Z]", "T")
> TRE's documentation at
> <https://laurikari.net/tre/documentation/regex-syntax/> says that a
> range "is shorthand for the full range of characters between those two
> [endpoints] (inclusive) in the collating sequence".
> Yet, T is *not* between A and Z in the Estonian collating sequence:
> sort(LETTERS)
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
> "Q" "R" "S"
> [20] "Z" "T" "U" "V" "W" "X" "Y"
> I realize that this may be a question about TRE rather than about R
> *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so
> the question also applies to PCRE), but I'm wondering if anyone has any
> insights ... (and yes, I know that the correct answer is "use [:alpha:]
> and don't worry about it")
> (In contrast, the ICU engine underlying stringi/stringr says "[t]he
> characters to include are determined by Unicode code point ordering" - see
> https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
> for links)
Your last (<sentence>) may point to the solution of the riddle:
Nowadays, typically in R
> capabilities()[["ICU"]]
[1] TRUE
but of course now one has to study if / why ICU seems to take
precedence over the locale's internal "sort"ing ..
Best regards,
Martin
More information about the R-devel
mailing list