[R] Problem comparing two strings

peter dalgaard pd@|gd @end|ng |rom gm@||@com
Mon Nov 18 16:48:04 CET 2019


A version of this came up not long ago in a slightly different context (bug 17369: parse() doesn't honor unicode in NFD normalization). 

The basic issue is that there are different unicode normalizations (look it up...).

Briefly, accented characters exist in two forms, one as a single code point and another as the base letter followed by the accent. 

I.e. there is the single letter "ä" and then "a\u308" which is a followed by "combining diaeresis" which effectively put a ¨ on top of the preceding character.

The utf8 package has code for normalizing strings.

-pd

> On 18 Nov 2019, at 16:11 , Björn Fisseler <bjoern.fisseler using googlemail.com> wrote:
> 
> Hello,
> 
> I'm struggling comparing two strings, which come from different data 
> sets. This strings are identical: "Alexander Jäger"
> 
> But when I compare these strings: string1 == string2
> the result is FALSE.
> 
> Looking at the raw bytes used to encode the strings, the results are 
> different:
> 
> string1: 41 6c 65 78 61 6e 64 65 72 20 4a c3 a4 67 65 72
> string2: 41 6c 65 78 61 6e 64 65 72 20 4a 61 cc 88 67 65 72
> 
> string2 comes from the file names of different files on my machine 
> (macOS), string1 comes from a data file (csv, UTF8 encoding).
> 
> It's obviously the umlaut "ä" in this example which is encoded with two 
> respectively three bytes. The question is how to change this? This 
> problem makes it impossible to join the two data sets based on the 
> names. I already checked the settings on my machine: Sys.getlocale() 
> returns "de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8". 
> Changing/forcing the encoding of the data didn't bring the results I 
> expected.
> 
> What else can I try?
> 
> Best regards
> 
>         Björn
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com



More information about the R-help mailing list