[Rd] why does [A-Z] include 'T' in an Estonian locale?
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Thu Jun 1 11:53:19 CEST 2023
On 5/30/23 17:45, Ben Bolker wrote:
> Inspired by this old Stack Overflow question
>
> https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
>
>
> I was wondering why this is TRUE:
>
> Sys.setlocale("LC_ALL", "et_EE")
> grepl("[A-Z]", "T")
>
> TRE's documentation at
> <https://laurikari.net/tre/documentation/regex-syntax/> says that a
> range "is shorthand for the full range of characters between those two
> [endpoints] (inclusive) in the collating sequence".
>
> Yet, T is *not* between A and Z in the Estonian collating sequence:
>
> sort(LETTERS)
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
> "Q" "R" "S"
> [20] "Z" "T" "U" "V" "W" "X" "Y"
>
> I realize that this may be a question about TRE rather than about R
> *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so
> the question also applies to PCRE), but I'm wondering if anyone has
> any insights ... (and yes, I know that the correct answer is "use
> [:alpha:] and don't worry about it")
The correct answer depends on what you want to do, but please see
?regexp in R:
"Because their interpretation is locale- and implementation-dependent,
character ranges are best avoided."
and
"The only portable way to specify all ASCII letters is to list them all
as the character class
‘[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]’."
This is from POSIX specification:
"In the POSIX locale, a range expression represents the set of collating
elements that fall between two elements in the collation sequence,
inclusive. In other locales, a range expression has unspecified
behavior: strictly conforming applications shall not rely on whether the
range expression is valid, or on the set of collating elements matched.
A range expression shall be expressed as the starting point and the
ending point separated by a <hyphen-minus> ( '-' )."
If you really want to know why the current implementation of R, TRE and
PCRE2 works in a certain way, you can check the code, but I don't think
it would be a good use of the time given what is written above.
It may be that TRE has a bug, maybe it doesn't do what was intended (see
comment "XXX - Should use collation order instead of encoding values in
character ranges." in the code), but I didn't check the code thoroughly.
Best
Tomas
>
> (In contrast, the ICU engine underlying stringi/stringr says "[t]he
> characters to include are determined by Unicode code point ordering" -
> see
>
> https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
>
>
> for links)
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list