[R] matching country name tables from different sources

Roger Bivand Roger.Bivand at nhh.no
Wed Jan 11 11:49:41 CET 2006


On Tue, 10 Jan 2006, McGehee, Robert wrote:

> I would throw a tolower() around s1 and s2 so that 'canada' matches with
> 'CANADA', and perhaps consider using a Levenshtein distance rather than
> the longest common subsequence.
> 
> An algorithm for Levenshtein distance can be found here (courtesy of
> Stephen Upton)
> https://stat.ethz.ch/pipermail/r-help/2005-January/062254.html

Or even ?agrep - uses Levenshtein edit distance and has an argument for 
ignoring case. First hit in RSiteSearch("fuzzy match"), by the way.

> 
> Robert
> 
> -----Original Message-----
> From: Werner Wernersen [mailto:pensterfuzzer at yahoo.de] 
> Sent: Tuesday, January 10, 2006 2:00 PM
> To: Gabor Grothendieck
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] matching country name tables from different sources
> 
> Thanks for the nice code, Gabor! 
>   
>   Unfortunately, it seems not to work for my purpose, confuses lots of
> countries when I compare two lists of over 150 countries each. 
>   Do you have any other suggestions?
>   
>   
> 
> Gabor Grothendieck <ggrothendieck at gmail.com> schrieb:  If they were the
> same you could use merge.   To figure out
> the correspondence automatically or semiautomatically, try this:
> 
> x <- c("Canada", "US", "Mexico")
> y <- c("Kanada", "United States", "Mehico")
> result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> result[] <- sapply(result, nchar)
> # try both which.max and which.min and if you are lucky
> # one of them will give unique values and that is the one to use
> # In this case which.max does.
> apply(result, 1, which.max)  # 1 2 3
> 
> # calculate longest common subsequence between 2 strings
> lcs2 <- function(s1,s2) {
>      longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
>      # Make sure args are strings
>      a <- as.character(s1); an <- nchar(s1)+1
>      b <- as.character(s2); bn <- nchar(s2)+1
> 
> 
>      # If one arg is an empty string, returns the length of the other
>      if (nchar(a)==0) return(nchar(b))
>      if (nchar(b)==0) return(nchar(a))
> 
> 
>      # Initialize matrix for calculations
>      m <- matrix("", nrow=an, ncol=bn)
> 
>      for (i in 2:an)
>           for (j in 2:bn)
>   m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
>    paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
>   else
>    longest(m[i-1,j], m[i,j-1])
> 
>      # Returns the distance
>      m[an,bn]
> }
> 
> 
> 
> On 1/10/06, Werner Wernersen 
>  wrote:
> > Hi,
> >
> >  Before I reinvent the wheel I wanted to kindly ask you for your
> opinion if there is a simple way to do it.
> >
> >  I want to merge a larger number of tables from different data sources
> in R and the matching criterium are country names. The tables are of
> different size and sometimes the country names do differ slightly.
> >
> >  Has anyone done this or any recommendation on what commands I should
> look at to automize this task as much as possible?
> >
> >  Thanks a lot for your effort in advance.
> >
> >  All the best,
> >    Werner
> >
> >
> >
> > ---------------------------------
> > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
> 
> 
> 
> 
> 		
> ---------------------------------
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no




More information about the R-help mailing list