[R] Matching names with non-English characters
Duncan Murdoch
murdoch.duncan at gmail.com
Mon May 13 18:58:32 CEST 2013
On 13/05/2013 12:05 PM, Spencer Graves wrote:
> Hello:
>
>
> How can one match names containing non-English characters that
> appear differently in different but related data files? For example, I
> have data on Raúl Grijalva, who represents the third district of Arizona
> in the US House of Representatives. This first name appears as "Raúl"
> in data read from one file and "Raul" from another.
>
>
> The ideal would convert both "Raúl" and "Raúl" to "Raul".
You shouldn't have both "Raúl" and "Raúl" in the same file. They are
different encodings for the same characters. (The first looks like
UTF-8, the second is your native encoding, presumably the Windows
Latin-1 variant, CP-1252. So your first problem is to identify the
encodings of your input files, and read them all in to a common
encoding. Converting them to UTF-8 in R makes the most sense, because
it includes the characters from all other encodings you're ever likely
to see.
Having both "Raúl" and "Raul" in the same file is a different issue.
The second one is an error or a variant spelling. In this case, you can
use
iconv("Raúl", to="ASCII//TRANSLIT")
on most platforms to find an ASCII approximation. (This works on my
Windows system; your mileage may vary.) As Jeff said, this is an
impossible problem in general, so you may well need some manual fixups
at the end.
Duncan Murdoch
> A
> reasonable alternative would identify the non-English characters and
> match on everything else ("^Ra" and "l$" in this case). The files all
> contain state and district, so "AZ-3" could be part of the solution.
> However, the file also contains data on Grijalva's predecessor in that
> office, Ben Quayle, so "AZ-3" is not enough.
>
>
> Thanks,
> Spencer
>
>
> p.s. My current data contains other similar cases, e.g.:
>
>
> Recipient District
> Raúl Grijalva AZ House 3
> Tony Cárdenas CA House 29
> Linda Sánchez CA House 38
> Raúl Labrador ID House 1
> André Carson IN House 7
> Bob Menéndez NJ Senate
> Ben Ray Luján NM House 3
> José Serrano NY House 15
> Nydia Velázquez NY House 7
> Rubén Hinojosa TX House 15
>
>
> These names all appear differently in another file I have. I've
> written an ugly function that can identify "nonstandard characters".
> I'm confident I can solve this problem. However, I'm adding things like
> this to the Ecdat package, and it would be more useful for others if I
> made better use of other capabilities in R.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list