# [R] matching country name tables from different sources

McGehee, Robert Robert.McGehee at geodecapital.com
Tue Jan 10 20:28:27 CET 2006

```I would throw a tolower() around s1 and s2 so that 'canada' matches with
'CANADA', and perhaps consider using a Levenshtein distance rather than
the longest common subsequence.

An algorithm for Levenshtein distance can be found here (courtesy of
Stephen Upton)
https://stat.ethz.ch/pipermail/r-help/2005-January/062254.html

Robert

-----Original Message-----
From: Werner Wernersen [mailto:pensterfuzzer at yahoo.de]
Sent: Tuesday, January 10, 2006 2:00 PM
To: Gabor Grothendieck
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] matching country name tables from different sources

Thanks for the nice code, Gabor!

Unfortunately, it seems not to work for my purpose, confuses lots of
countries when I compare two lists of over 150 countries each.
Do you have any other suggestions?

Gabor Grothendieck <ggrothendieck at gmail.com> schrieb:  If they were the
same you could use merge.   To figure out
the correspondence automatically or semiautomatically, try this:

y <- c("Kanada", "United States", "Mehico")
result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
result[] <- sapply(result, nchar)
# try both which.max and which.min and if you are lucky
# one of them will give unique values and that is the one to use
# In this case which.max does.
apply(result, 1, which.max)  # 1 2 3

# calculate longest common subsequence between 2 strings
lcs2 <- function(s1,s2) {
longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
# Make sure args are strings
a <- as.character(s1); an <- nchar(s1)+1
b <- as.character(s2); bn <- nchar(s2)+1

# If one arg is an empty string, returns the length of the other
if (nchar(a)==0) return(nchar(b))
if (nchar(b)==0) return(nchar(a))

# Initialize matrix for calculations
m <- matrix("", nrow=an, ncol=bn)

for (i in 2:an)
for (j in 2:bn)
m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
else
longest(m[i-1,j], m[i,j-1])

# Returns the distance
m[an,bn]
}

On 1/10/06, Werner Wernersen
wrote:
> Hi,
>
>  Before I reinvent the wheel I wanted to kindly ask you for your
opinion if there is a simple way to do it.
>
>  I want to merge a larger number of tables from different data sources
in R and the matching criterium are country names. The tables are of
different size and sometimes the country names do differ slightly.
>
>  Has anyone done this or any recommendation on what commands I should
look at to automize this task as much as possible?
>
>
>  All the best,
>    Werner
>
>
>
> ---------------------------------
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
http://www.R-project.org/posting-guide.html
>

---------------------------------

[[alternative HTML version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
http://www.R-project.org/posting-guide.html

```