[R] matching country name tables from different sources

Gabor Grothendieck ggrothendieck at gmail.com
Wed Jan 11 15:41:30 CET 2006


I was aware of that which is why I mentioned that it is usually used
for matching last names rather than countries and noted possible need to
modify the algorithm slightly.  soundex is a relatively simple algorithm
so its not too hard.  For example, one could just code the first
letter too.

On 1/11/06, SAULEAU Erik-André <SAULEAUEA at ch-mulhouse.fr> wrote:
>
>
>  dear all,
>
> yes but the problem with soundex for example is that it does not work when
> an error occur in the first place (Canada vs Kanada) as it keeps the fist
> character. It seems that you have to look after an approximate string
> matching algorithm (for example, a very good one if from Porter-Jaro and
> Winkler at the US Census bureau or have o look to the book of Navarro about
> classification of algorithm).
>
> HTH and an happy new year, erik.
>
>
> -----Message d'origine-----
> De: Gabor Grothendieck
> A: Werner Wernersen
> Cc: r-help at stat.math.ethz.ch
> Date: 10/01/2006 21:16
> Objet: Re: [R] matching country name tables from different sources
>
>
> One other thing to try could be soundex.  ITs normally used for
> last names but it might work here too.  Google to find the
> soundex encoding rules.  Reviewing the country names might
> suggest minor modifications to the soundex algorithm to
> improve it for your case.
>
> On 1/10/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > You can improve it somewhat by first accepting all the largest
> > matches and removing the rows and columns for those and
> > repeatedly doing that with what is left.
> >
> > On 1/10/06, Werner Wernersen <pensterfuzzer at yahoo.de> wrote:
> > > Thanks for the nice code, Gabor!
> > >
> > > Unfortunately, it seems not to work for my purpose, confuses lots of
> > > countries when I compare two lists of over 150 countries each.
> > > Do you have any other suggestions?
> > >
> > >
> > >
> > > Gabor Grothendieck <ggrothendieck at gmail.com> schrieb:
> > > If they were the same you could use merge. To figure out
> > > the correspondence automatically or semiautomatically, try this:
> > >
> > > x <- c("Canada", "US", "Mexico")
> > > y <- c("Kanada", "United States", "Mehico")
> > > result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> > > result[] <- sapply(result, nchar)
> > > # try both which.max and which.min and if you are lucky
> > > # one of them will give unique values and that is the one to use
> > > # In this case which.max does.
> > > apply(result, 1, which.max) # 1 2 3
> > >
> > > # calculate longest common subsequence between 2 strings
> > > lcs2 <- function(s1,s2) {
> > > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
> > > # Make sure args are strings
> > > a <- as.character(s1); an <- nchar(s1)+1
> > > b <- as.character(s2); bn <- nchar(s2)+1
> > >
> > >
> > > # If one arg is an empty string, returns the length of the other
> > > if (nchar(a)==0) return(nchar(b))
> > > if (nchar(b)==0) return(nchar(a))
> > >
> > >
> > > # Initialize matrix for calculations
> > > m <- matrix("", nrow=an, ncol=bn)
> > >
> > > for (i in 2:an)
> > > for (j in 2:bn)
> > > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
> > > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
> > > else
> > > longest(m[i-1,j], m[i,j-1])
> > >
> > > # Returns the distance
> > > m[an,bn]
> > > }
> > >
> > >
> > >
> > > On 1/10/06, Werner Wernersen wrote:
> > > > Hi,
> > > >
> > > > Before I reinvent the wheel I wanted to kindly ask you for your
> opinion if
> > > there is a simple way to do it.
> > > >
> > > > I want to merge a larger number of tables from different data
> sources in R
> > > and the matching criterium are country names. The tables are of
> different
> > > size and sometimes the country names do differ slightly.
> > > >
> > > > Has anyone done this or any recommendation on what commands I
> should look
> > > at to automize this task as much as possible?
> > > >
> > > > Thanks a lot for your effort in advance.
> > > >
> > > > All the best,
> > > > Werner
> > > >
> > > >
> > > >
> > > > ---------------------------------
> > > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu
> PC!
> > > >
> > > > [[alternative HTML version deleted]]
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> > >
> > >
> > > ________________________________
> > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu
> PC!
> > > Jetzt Yahoo! Messenger installieren!
> > >
> > >
> >
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>
> ************************************************************************
> **********
> Afin d'eviter toute propagation de virus informatique, et en complement
> des dispositifs en place, ce message (et ses pieces jointes s'il y en a)
>
> a ete automatiquement analyse par un antivirus de messagerie.
> ************************************************************************
> **********
>
> **********************************************************************************
> Afin d'eviter toute propagation de virus informatique, et en complement
> des dispositifs en place, ce message (et ses pieces jointes s'il y en a)
> a ete automatiquement analyse par un antivirus de messagerie.
> **********************************************************************************
>




More information about the R-help mailing list