[R] Matching names with non-English characters
Jeff Newmiller
jdnewmil at dcn.davis.CA.us
Mon May 13 18:18:09 CEST 2013
Build a lookup table for your data.
I think it is a fools errand to think that you can automatically "normalize" arbitrary Unicode characters to an ASCII form that everyone will agree on.
BTW: To avoid propagating open joins your data should probably have some kind of id for the term those Representatives are serving.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
Spencer Graves <spencer.graves at structuremonitoring.com> wrote:
>Hello:
>
>
> How can one match names containing non-English characters that
>appear differently in different but related data files? For example, I
>
>have data on Raúl Grijalva, who represents the third district of
>Arizona
>in the US House of Representatives. This first name appears as "Raúl"
>
>in data read from one file and "Raul" from another.
>
>
> The ideal would convert both "Raúl" and "Raúl" to "Raul". A
>reasonable alternative would identify the non-English characters and
>match on everything else ("^Ra" and "l$" in this case). The files all
>contain state and district, so "AZ-3" could be part of the solution.
>However, the file also contains data on Grijalva's predecessor in that
>office, Ben Quayle, so "AZ-3" is not enough.
>
>
> Thanks,
> Spencer
>
>
>p.s. My current data contains other similar cases, e.g.:
>
>
> Recipient District
>Raúl Grijalva AZ House 3
>Tony Cárdenas CA House 29
>Linda Sánchez CA House 38
>Raúl Labrador ID House 1
>André Carson IN House 7
>Bob Menéndez NJ Senate
>Ben Ray Luján NM House 3
>José Serrano NY House 15
>Nydia Velázquez NY House 7
>Rubén Hinojosa TX House 15
>
>
> These names all appear differently in another file I have. I've
>written an ugly function that can identify "nonstandard characters".
>I'm confident I can solve this problem. However, I'm adding things
>like
>this to the Ecdat package, and it would be more useful for others if I
>made better use of other capabilities in R.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list