[R] Matching names with non-English characters

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Mon May 13 18:18:09 CEST 2013


Build a lookup table for your data.

I think it is a fools errand to think that you can automatically "normalize" arbitrary Unicode characters to an ASCII form that everyone will agree on.

BTW: To avoid propagating open joins your data should probably have some kind of id for the term those Representatives are serving.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Spencer Graves <spencer.graves at structuremonitoring.com> wrote:

>Hello:
>
>
>       How can one match names containing non-English characters that 
>appear differently in different but related data files?  For example, I
>
>have data on Raúl Grijalva, who represents the third district of
>Arizona 
>in the US House of Representatives.  This first name appears as "Raúl"
>
>in data read from one file and "Raul" from another.
>
>
>       The ideal would convert both "Raúl" and "Raúl" to "Raul".  A 
>reasonable alternative would identify the non-English characters and 
>match on everything else ("^Ra" and "l$" in this case).  The files all 
>contain state and district, so "AZ-3" could be part of the solution. 
>However, the file also contains data on Grijalva's predecessor in that 
>office, Ben Quayle, so "AZ-3" is not enough.
>
>
>       Thanks,
>       Spencer
>
>
>p.s.  My current data contains other similar cases, e.g.:
>
>
>     Recipient     District
>Raúl Grijalva   AZ House 3
>Tony Cárdenas   CA House 29
>Linda Sánchez   CA House 38
>Raúl Labrador   ID House 1
>André Carson    IN House 7
>Bob Menéndez    NJ Senate
>Ben Ray Luján   NM House 3
>José Serrano    NY House 15
>Nydia Velázquez NY House 7
>Rubén Hinojosa  TX House 15
>
>
>       These names all appear differently in another file I have. I've 
>written an ugly function that can identify "nonstandard characters". 
>I'm confident I can solve this problem.  However, I'm adding things
>like 
>this to the Ecdat package, and it would be more useful for others if I 
>made better use of other capabilities in R.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list