[R] Problem comparing two strings

Ivan Krylov kry|ov@r00t @end|ng |rom gm@||@com
Mon Nov 18 16:34:34 CET 2019


On Mon, 18 Nov 2019 16:11:44 +0100
"Björn Fisseler" <bjoern.fisseler using googlemail.com> wrote:

> It's obviously the umlaut "ä" in this example which is encoded with
> two respectively three bytes. The question is how to change this?

Welcome to the wonderful world of Unicode-related problems! It is,
indeed, possible to represent the same glyph using either one
code-point (LATIN SMALL LETTER A WITH DIAERESIS) or two code points
(LATIN SMALL LETTER A followed by COMBINING DIAERESIS). (Other
combinations of code points resulting in the same glyph are probably
also possible.)

What you are looking for is called "Unicode normalization" and it is
implemented in the stringi package, in functions stri_trans_nfc
(normalization: there are multiple normal forms to choose from but W3C
guidelines recommend NFC) and stri_compare / stri_cmp (test for
canonical equivalence).

See also: ?stringi::stri_cmp and https://stackoverflow.com/a/20684794

-- 
Best regards,
Ivan



More information about the R-help mailing list