[R] Symbol/String comparison in R

Fri Apr 15 05:05:16 CEST 2022

First off, while it is true that 16r61 = 10r97,
there is nothing about the interaction
> charToRaw("a")
61
that suggests to the unwary that 61 does not mean
six tens and one unit.  In fact that 61 isn't even
a number, though it looks like one.  It's a "raw".

And that's the problem.  The whole point of "raws"
is to forget how a string was encoded (that is, to
forget what it MEANS), and compare it DIFFERENTLY
from the way it would be compared as a string.  So
it really has no relevance to how string collation
is done.

There is good reason to use "raws", and that is when
you are interfacing with code written in some other
language that for some reason wants byte sequences.

As for the default collation order and other local aspects,
R is expected to pick that up from the environment.
If I have
   l=en_NZ.iso88591
   export LC_COLLATE=$l
   export LC_NUMERIC=$l
then when I run R, I expect it to take en_NZ.iso88591
as the locale for collation (it does) and for numbers
(it doesn't, it insists on LC_NUMERIC=C).

The help for ?"<" is quite clear that collation for
UTF-8 is *not* byte-by-byte but by Unicode code-point.
Now if you are given *valid* UTF-8 in which each
code-point is represented by the smallest number of
bytes possible, byte-by-byte and code-point-by-code-
point should agree, but for quite a bit of "UTF-8" out
there (such as the stuff Java used to generate) they
can be teased apart.

Alexander Pope's advice is good counsel here.

A little learning is a dangerous thing ;
Drink deep, or taste not the Pierian spring :
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.
...
While from the bounded level of our mind
Short views we take, nor see the lengths behind,
But, more advanced, behold with strange surprise
New distant scenes of endless science rise !
...

If you really want to understand this stuff,
start with "Unicode Demystified", go on to
"The Unicode Standard 14.0.0"
https://www.unicode.org/versions/Unicode14.0.0/
then over to ISO 14651
https://en.wikipedia.org/wiki/ISO/IEC_14651
then to the Unicode Collation Algorithm
and The International Components for Unicode.
Don't forget to check out the Unicode CLDR
(Common Locale Data Repository) version 41
https://home.unicode.org/unicode-cldr-version-41-released/

Or you could let R get on with it and study
some more statistics instead.

On Fri, 15 Apr 2022 at 02:24, Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:

> While the advice that collation order is not necessarily determined by
> encoding is helpful, the advice suggesting that charToRaw is to be always
> avoided rings false to me, since 61 hexadecimal is the same as 97 decimal.
>
> I am hoping someone will come along and offer useful input like where to
> find the actual collation order implemented by the now-default
> Sys.getlocale("LC_COLATE")="C.UTF-8", since I was under the impression that
> this particular collation was in fact supposed to collate according to the
> numerical magnitude of the UTF-8 code points but it does not appear to do
> so.
>
> On April 14, 2022 4:25:17 AM PDT, Richard O'Keefe <raoknz using gmail.com>
> wrote:
> >To the original poster: don't even think about
> >charToRaw.  For one thing, the integer code that
> >corresponds to "a" can be found thus:
> >> library(gtools)
> >> asc("a")
> >97
> >and the answer is (predictably) 97, not 61.
> >
> >> ?"<"
> >...
> >     Comparison of strings in character vectors is lexicographic within
> >     the strings using the collating sequence of the locale in use: see
> >     'locales'.  The collating sequence of locales such as 'en_US' is
> >     normally different from 'C' (which should use ASCII) and can be
> >     surprising.  Beware of making _any_ assumptions about the
> >     collation order
> >...
> >
> >In a UNIX environment, the collating order R uses will
> >normally match the collating order that the system
> >sort(1) command uses.  This is also the order that is
> >used by the strcoll(3) library function.  There is an
> >ISO standard, not for how to compare strings, but for
> >specifying the rules for how to compare strings.  The
> >rules can be amazingly elaborate requiring up to seven
> >different passes and not all of them in the same direction.
> >
> >ORIGINALLY the order was lexicographical left to right
> >by byte values (like the strcmp(3) library function) but
> >in a world of about 6000 languages and an amazing number
> >of scripts, that just doesn't match what people actually
> >want to do.
> >
> >> icuGetCollate()
> >will tell you what collation rules R is following.
> >> ?icuGetCollate
> >will not so much tell you more than you wanted to know
> >about collation as hint at it.
> >
> >These days, with Unicode and internationalisation,
> >text encoding and collation are just insanely complex.
> >R goes to a lot of trouble to hide this from you.
> >LET IT.
> >
> >
> >
> >On Thu, 14 Apr 2022 at 13:38, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
> >
> >> https://en.wikipedia.org/wiki/ASCII
> >> There is a table towards the end of the document. Some of the other
> pieces
> >> may be of interest and/or relevant.
> >>
> >> Tim
> >>
> >> -----Original Message-----
> >> From: R-help <r-help-bounces using r-project.org> On Behalf Of Kristjan Kure
> >> Sent: Wednesday, April 13, 2022 10:06 AM
> >> To: r-help using r-project.org
> >> Subject: [R] Symbol/String comparison in R
> >>
> >> [External Email]
> >>
> >> Hi!
> >>
> >> Sorry, I am a beginner in R.
> >>
> >> I was not able to find answers to my questions (tried Google, Stack
> >> Overflow, etc). Please correct me if anything is wrong here.
> >>
> >> When comparing symbols/strings in R - raw numeric values are compared
> >> symbol by symbol starting from left? If raw numeric values are not used
> is
> >> there an ASCII / Unicode table where symbols have values/ranking/order
> and
> >> R compares those values?
> >>
> >> *2) Comparing symbols*
> >> Letter "a" raw value is 61, letter "b" raw value is 62? Is this correct?
> >>
> >> # Raw value for "a" = 61
> >> a_raw <- charToRaw("a")
> >> a_raw
> >>
> >> # Raw value for "b" = 62
> >> b_raw <- charToRaw("b")
> >> b_raw
> >>
> >> # equals TRUE
> >> "a" < "b"
> >>
> >> Ok, so 61 is less than 62 so it's TRUE. Is this correct?
> >>
> >> *3) Comparing strings #1*
> >> "1040" <= "12000"
> >>
> >> raw_1040 <- charToRaw("1040")
> >> raw_1040
> >> #31 *30* (comparison happens with the second symbol) 34 30
> >>
> >> raw_12000 <- charToRaw("12000")
> >> raw_12000
> >> #31 *32* (comparison happens with the second symbol) 30 30 30
> >>
> >> The symbol in the second position is 30 and it's less than 32. Equals to
> >> true. Is this correct?
> >>
> >> *4) Comparing strings #2*
> >> "1040" <= "10000"
> >>
> >> raw_1040 <- charToRaw("1040")
> >> raw_1040
> >> #31 30 *34*  (comparison happens with third symbol) 30
> >>
> >> raw_10000 <- charToRaw("10000")
> >> raw_10000
> >> #31 30 *30*  (comparison happens with third symbol) 30 30
> >>
> >> The symbol in the third position is 34 is greater than 30. Equals to
> false.
> >> Is this correct?
> >>
> >> *5) Problem - Why does this equal FALSE?* *"A" < "a"*
> >>
> >> 41 < 61 # FALSE?
> >>
> >> # Raw value for "A" = 41
> >> A_raw <- charToRaw("A")
> >> A_raw
> >>
> >> # Raw value for "a" = 61
> >> a_raw <- charToRaw("a")
> >> a_raw
> >>
> >> Why is capitalized "A" not less than lowercase "a"? Based on raw values
> it
> >> should be. What am I missing here?
> >>
> >> Thanks
> >> Kristjan
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=9E-P8HOWO0s4h1p__tW4o8QGtge3bJ9VUJEDH-e-U_8OKRu2p1zazebKjPltKrWM&s=rhYKCkMRBFMzOVf8rVaRiO1Puh-rTSWAS8P6hoSzdgc&e=
> >> PLEASE do read the posting guide
> >>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=9E-P8HOWO0s4h1p__tW4o8QGtge3bJ9VUJEDH-e-U_8OKRu2p1zazebKjPltKrWM&s=fI_1ZAYJFp1nrJkOV4i4ueqf4o1MD1gKHzb6AyciJUc&e=
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>

	[[alternative HTML version deleted]]