[Rd] why does [A-Z] include 'T' in an Estonian locale?
Ben Bolker
bbo|ker @end|ng |rom gm@||@com
Sat Jun 3 17:34:20 CEST 2023
Thanks, I do know about the docs you quoted. Thanks for pointing me
to the comment in the code.
I've posted an issue (a request to make the documentation match the
code) at the TRE repository:
https://github.com/laurikari/tre/issues/88
On 2023-06-01 5:53 a.m., Tomas Kalibera wrote:
>
> On 5/30/23 17:45, Ben Bolker wrote:
>> Inspired by this old Stack Overflow question
>>
>> https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions
>>
>> I was wondering why this is TRUE:
>>
>> Sys.setlocale("LC_ALL", "et_EE")
>> grepl("[A-Z]", "T")
>>
>> TRE's documentation at
>> <https://laurikari.net/tre/documentation/regex-syntax/> says that a
>> range "is shorthand for the full range of characters between those two
>> [endpoints] (inclusive) in the collating sequence".
>>
>> Yet, T is *not* between A and Z in the Estonian collating sequence:
>>
>> sort(LETTERS)
>> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
>> "Q" "R" "S"
>> [20] "Z" "T" "U" "V" "W" "X" "Y"
>>
>> I realize that this may be a question about TRE rather than about R
>> *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so
>> the question also applies to PCRE), but I'm wondering if anyone has
>> any insights ... (and yes, I know that the correct answer is "use
>> [:alpha:] and don't worry about it")
>
> The correct answer depends on what you want to do, but please see
> ?regexp in R:
>
> "Because their interpretation is locale- and implementation-dependent,
> character ranges are best avoided."
>
> and
>
> "The only portable way to specify all ASCII letters is to list them all
> as the character class
> ‘[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]’."
>
> This is from POSIX specification:
>
> "In the POSIX locale, a range expression represents the set of collating
> elements that fall between two elements in the collation sequence,
> inclusive. In other locales, a range expression has unspecified
> behavior: strictly conforming applications shall not rely on whether the
> range expression is valid, or on the set of collating elements matched.
> A range expression shall be expressed as the starting point and the
> ending point separated by a <hyphen-minus> ( '-' )."
>
> If you really want to know why the current implementation of R, TRE and
> PCRE2 works in a certain way, you can check the code, but I don't think
> it would be a good use of the time given what is written above.
>
> It may be that TRE has a bug, maybe it doesn't do what was intended (see
> comment "XXX - Should use collation order instead of encoding values in
> character ranges." in the code), but I didn't check the code thoroughly.
>
> Best
> Tomas
>
>>
>> (In contrast, the ICU engine underlying stringi/stringr says "[t]he
>> characters to include are determined by Unicode code point ordering" -
>> see
>>
>> https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163
>>
>> for links)
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list