[R] Symbol/String comparison in R

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Fri Apr 15 05:26:59 CEST 2022


Hello,

Inline.

Às 22:09 de 14/04/2022, Kristjan Kure escreveu:
> Thank you, Rui. Not sure I got everything right, but here it is:
> 
> *current_loc <- Sys.getlocale("LC_COLLATE")*
> #> [1] "Estonian_Estonia.1257"
> 
> "A" < "a"
> #41 < 61
> #> [1] FALSE
> raw_A <- charToRaw("A") #41
> raw_a <- charToRaw("a") #61
> # Not OK - should be TRUE (41 is less than 61)
> 
> "A" > "a"
> #41 > 61
> #> [1] TRUE
> raw_A <- charToRaw("A") #41
> raw_a <- charToRaw("a") #61
> # Not OK - should be FALSE (41 is not bigger than 61)
> 
> *Sys.setlocale("LC_COLLATE", locale = "C")*
> 
> "A" < "a"
> #41 < 61
> #> [1] TRUE
> raw_A <- charToRaw("A") #41
> raw_a <- charToRaw("a") #61
> 
> # OK - (41 is less than 61)
> 
> "A" > "a"
> #41 > 61
> #> [1] FALSE
> raw_A <- charToRaw("A") #41
> raw_a <- charToRaw("a") #61
> 
> # OK - (41 is not bigger than 61)
> 
> *Sys.setlocale("LC_COLLATE", current_loc)*
> *
> *
> *Conclusion: Comparing strings using charToRaw() only works correctly 
> with locale = C?*
> *

You are still mistaking the locale with the ASCII code (raw).
Windows codepages like 1252 or your 1257 are supersets of the ASCII code 
and the ASCII hex codes make a lot of sense. The upper case and lower 
letters are 2^5 == 32 == 0x20 apart so set the 5th bit to go from upper 
to lower case:

"A": 0100 0001 == 0x41
"a": 0110 0001 == 0x61

"B": 0100 0010
"b": 0110 0010

etc.

This only relates to human alphabets and languages because its an 
attempt to make an electronic code usable to transmit/record/retrieve 
text in human readable way. But each language's lexicographic order need 
not follow this encoding's order even if it's what is used to record it 
electronically.
In the examples below you'll see that to change the locale does not 
change the numeric codes.

Comparing strings using charToRaw() only works correctly if what you 
want is to compare codes, not letters (in the sense of human writing).


old_loc <- Sys.getlocale("LC_COLLATE")

# hexadecimal base integers
raw_A <- charToRaw("A") # 0x41
raw_a <- charToRaw("a") # 0x61

raw_A < raw_a
#> [1] TRUE
raw_A > raw_a
#> [1] FALSE

as.integer(raw_A)
#> [1] 65
as.integer(raw_a)
#> [1] 97

Sys.setlocale("LC_COLLATE", locale = "C")
#> [1] "C"

(C_raw_A <- charToRaw("A")) # 0x41
#> [1] 41
(C_raw_a <- charToRaw("a")) # 0x61
#> [1] 61
C_raw_A < C_raw_a
#> [1] TRUE
C_raw_A > C_raw_a
#> [1] FALSE

identical(raw_A, C_raw_A)
#> [1] TRUE
identical(raw_a, C_raw_a)
#> [1] TRUE

Sys.setlocale("LC_COLLATE", old_loc)
#> [1] "Portuguese_Portugal.1252"


Hope this helps,

Rui Barradas


> *
> Regards,
> Kristjan*
> *
> *
> *
> *
> *
> 
> On Thu, Apr 14, 2022 at 10:01 PM Rui Barradas <ruipbarradas using sapo.pt 
> <mailto:ruipbarradas using sapo.pt>> wrote:
> 
>     Hello,
> 
>     1) The best I could find on lower case/upper case is [1];
>     The Wikipedia page you link to is about a code page and the collating
>     sequence is the same as ASCII so no, that's not it.
> 
>     2) In the cp1252 table "A" < "a", it follows the numeric order 0x31 <
>     0x41. But what R is using is the locale LC_COLLATE setting, not the "C"
>     one.
> 
>     How to validate the end results? The best way is to check the current
>     setting, with Sys.getlocale.
> 
> 
> 
>     [1]
>     https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false
>     <https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false>
> 
> 
>     Hope this helps,
> 
>     Rui Barradas
> 
>     Às 16:33 de 14/04/2022, Kristjan Kure escreveu:
>      > Hi Rui
>      >
>      > Thank you for the code snippet.
>      >
>      > 1) How do you find your "Portuguese_Portugal.1252" symbols table now?
>      > Is it this https://en.wikipedia.org/wiki/Windows-1252
>     <https://en.wikipedia.org/wiki/Windows-1252>
>      > <https://en.wikipedia.org/wiki/Windows-1252
>     <https://en.wikipedia.org/wiki/Windows-1252>>?
>      >
>      > 2) What attributes and values do you check to validate the end
>     result?
>      > I see there is a section "Codepage layout" and I can find "A" and
>     "a"
>      > symbols.
>      >
>      > What values on that table tell you "A" is bigger than "a"?
>      > "A" < "a" # returns FALSE
>      > "A" > "a" # returns TRUE
>      >
>      > PS! My locale is Estonian_Estonia.1257
>      >
>      > Regards,
>      > Kristjan
>      >
>      > On Thu, Apr 14, 2022 at 5:05 PM Rui Barradas
>     <ruipbarradas using sapo.pt <mailto:ruipbarradas using sapo.pt>
>      > <mailto:ruipbarradas using sapo.pt <mailto:ruipbarradas using sapo.pt>>> wrote:
>      >
>      >     Hello,
>      >
>      >     This is a locale issue, you are counting on the ASCII table
>     codes but
>      >     that's only valid for the "C" locale.
>      >
>      >     old_loc <- Sys.getlocale("LC_COLLATE")
>      >
>      >     "A" < "a"
>      >     #> [1] FALSE
>      >     "A" > "a"
>      >     #> [1] TRUE
>      >
>      >     Sys.setlocale("LC_COLLATE", locale = "C")
>      >     #> [1] "C"
>      >
>      >     "A" < "a"
>      >     #> [1] TRUE
>      >     "A" > "a"
>      >     #> [1] FALSE
>      >
>      >     Sys.setlocale("LC_COLLATE", old_loc)
>      >     #> [1] "Portuguese_Portugal.1252"
>      >
>      >
>      >     Hope this helps,
>      >
>      >     Rui Barradas
>      >
>      >     Às 15:06 de 13/04/2022, Kristjan Kure escreveu:
>      >      > Hi!
>      >      >
>      >      > Sorry, I am a beginner in R.
>      >      >
>      >      > I was not able to find answers to my questions (tried
>     Google, Stack
>      >      > Overflow, etc). Please correct me if anything is wrong here.
>      >      >
>      >      > When comparing symbols/strings in R - raw numeric values
>     are compared
>      >      > symbol by symbol starting from left? If raw numeric values are
>      >     not used is
>      >      > there an ASCII / Unicode table where symbols have
>      >     values/ranking/order and
>      >      > R compares those values?
>      >      >
>      >      > *2) Comparing symbols*
>      >      > Letter "a" raw value is 61, letter "b" raw value is 62? Is
>     this
>      >     correct?
>      >      >
>      >      > # Raw value for "a" = 61
>      >      > a_raw <- charToRaw("a")
>      >      > a_raw
>      >      >
>      >      > # Raw value for "b" = 62
>      >      > b_raw <- charToRaw("b")
>      >      > b_raw
>      >      >
>      >      > # equals TRUE
>      >      > "a" < "b"
>      >      >
>      >      > Ok, so 61 is less than 62 so it's TRUE. Is this correct?
>      >      >
>      >      > *3) Comparing strings #1*
>      >      > "1040" <= "12000"
>      >      >
>      >      > raw_1040 <- charToRaw("1040")
>      >      > raw_1040
>      >      > #31 *30* (comparison happens with the second symbol) 34 30
>      >      >
>      >      > raw_12000 <- charToRaw("12000")
>      >      > raw_12000
>      >      > #31 *32* (comparison happens with the second symbol) 30 30 30
>      >      >
>      >      > The symbol in the second position is 30 and it's less than 32.
>      >     Equals to
>      >      > true. Is this correct?
>      >      >
>      >      > *4) Comparing strings #2*
>      >      > "1040" <= "10000"
>      >      >
>      >      > raw_1040 <- charToRaw("1040")
>      >      > raw_1040
>      >      > #31 30 *34*  (comparison happens with third symbol) 30
>      >      >
>      >      > raw_10000 <- charToRaw("10000")
>      >      > raw_10000
>      >      > #31 30 *30*  (comparison happens with third symbol) 30 30
>      >      >
>      >      > The symbol in the third position is 34 is greater than 30.
>     Equals
>      >     to false.
>      >      > Is this correct?
>      >      >
>      >      > *5) Problem - Why does this equal FALSE?*
>      >      > *"A" < "a"*
>      >      >
>      >      > 41 < 61 # FALSE?
>      >      >
>      >      > # Raw value for "A" = 41
>      >      > A_raw <- charToRaw("A")
>      >      > A_raw
>      >      >
>      >      > # Raw value for "a" = 61
>      >      > a_raw <- charToRaw("a")
>      >      > a_raw
>      >      >
>      >      > Why is capitalized "A" not less than lowercase "a"? Based
>     on raw
>      >     values it
>      >      > should be. What am I missing here?
>      >      >
>      >      > Thanks
>      >      > Kristjan
>      >      >
>      >      >       [[alternative HTML version deleted]]
>      >      >
>      >      > ______________________________________________
>      >      > R-help using r-project.org <mailto:R-help using r-project.org>
>     <mailto:R-help using r-project.org <mailto:R-help using r-project.org>> mailing list
>      >     -- To UNSUBSCRIBE and more, see
>      >      > https://stat.ethz.ch/mailman/listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>
>      >     <https://stat.ethz.ch/mailman/listinfo/r-help
>     <https://stat.ethz.ch/mailman/listinfo/r-help>>
>      >      > PLEASE do read the posting guide
>      > http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>
>      >     <http://www.R-project.org/posting-guide.html
>     <http://www.R-project.org/posting-guide.html>>
>      >      > and provide commented, minimal, self-contained,
>     reproducible code.
>      >
>



More information about the R-help mailing list