[R] Symbol/String comparison in R

Richard O'Keefe r@oknz @end|ng |rom gm@||@com
Thu Apr 14 13:25:17 CEST 2022


To the original poster: don't even think about
charToRaw.  For one thing, the integer code that
corresponds to "a" can be found thus:
> library(gtools)
> asc("a")
97
and the answer is (predictably) 97, not 61.

> ?"<"
...
     Comparison of strings in character vectors is lexicographic within
     the strings using the collating sequence of the locale in use: see
     'locales'.  The collating sequence of locales such as 'en_US' is
     normally different from 'C' (which should use ASCII) and can be
     surprising.  Beware of making _any_ assumptions about the
     collation order
...

In a UNIX environment, the collating order R uses will
normally match the collating order that the system
sort(1) command uses.  This is also the order that is
used by the strcoll(3) library function.  There is an
ISO standard, not for how to compare strings, but for
specifying the rules for how to compare strings.  The
rules can be amazingly elaborate requiring up to seven
different passes and not all of them in the same direction.

ORIGINALLY the order was lexicographical left to right
by byte values (like the strcmp(3) library function) but
in a world of about 6000 languages and an amazing number
of scripts, that just doesn't match what people actually
want to do.

> icuGetCollate()
will tell you what collation rules R is following.
> ?icuGetCollate
will not so much tell you more than you wanted to know
about collation as hint at it.

These days, with Unicode and internationalisation,
text encoding and collation are just insanely complex.
R goes to a lot of trouble to hide this from you.
LET IT.



On Thu, 14 Apr 2022 at 13:38, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:

> https://en.wikipedia.org/wiki/ASCII
> There is a table towards the end of the document. Some of the other pieces
> may be of interest and/or relevant.
>
> Tim
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Kristjan Kure
> Sent: Wednesday, April 13, 2022 10:06 AM
> To: r-help using r-project.org
> Subject: [R] Symbol/String comparison in R
>
> [External Email]
>
> Hi!
>
> Sorry, I am a beginner in R.
>
> I was not able to find answers to my questions (tried Google, Stack
> Overflow, etc). Please correct me if anything is wrong here.
>
> When comparing symbols/strings in R - raw numeric values are compared
> symbol by symbol starting from left? If raw numeric values are not used is
> there an ASCII / Unicode table where symbols have values/ranking/order and
> R compares those values?
>
> *2) Comparing symbols*
> Letter "a" raw value is 61, letter "b" raw value is 62? Is this correct?
>
> # Raw value for "a" = 61
> a_raw <- charToRaw("a")
> a_raw
>
> # Raw value for "b" = 62
> b_raw <- charToRaw("b")
> b_raw
>
> # equals TRUE
> "a" < "b"
>
> Ok, so 61 is less than 62 so it's TRUE. Is this correct?
>
> *3) Comparing strings #1*
> "1040" <= "12000"
>
> raw_1040 <- charToRaw("1040")
> raw_1040
> #31 *30* (comparison happens with the second symbol) 34 30
>
> raw_12000 <- charToRaw("12000")
> raw_12000
> #31 *32* (comparison happens with the second symbol) 30 30 30
>
> The symbol in the second position is 30 and it's less than 32. Equals to
> true. Is this correct?
>
> *4) Comparing strings #2*
> "1040" <= "10000"
>
> raw_1040 <- charToRaw("1040")
> raw_1040
> #31 30 *34*  (comparison happens with third symbol) 30
>
> raw_10000 <- charToRaw("10000")
> raw_10000
> #31 30 *30*  (comparison happens with third symbol) 30 30
>
> The symbol in the third position is 34 is greater than 30. Equals to false.
> Is this correct?
>
> *5) Problem - Why does this equal FALSE?* *"A" < "a"*
>
> 41 < 61 # FALSE?
>
> # Raw value for "A" = 41
> A_raw <- charToRaw("A")
> A_raw
>
> # Raw value for "a" = 61
> a_raw <- charToRaw("a")
> a_raw
>
> Why is capitalized "A" not less than lowercase "a"? Based on raw values it
> should be. What am I missing here?
>
> Thanks
> Kristjan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=9E-P8HOWO0s4h1p__tW4o8QGtge3bJ9VUJEDH-e-U_8OKRu2p1zazebKjPltKrWM&s=rhYKCkMRBFMzOVf8rVaRiO1Puh-rTSWAS8P6hoSzdgc&e=
> PLEASE do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=9E-P8HOWO0s4h1p__tW4o8QGtge3bJ9VUJEDH-e-U_8OKRu2p1zazebKjPltKrWM&s=fI_1ZAYJFp1nrJkOV4i4ueqf4o1MD1gKHzb6AyciJUc&e=
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list