[R] matching country name tables from different sources
bogdan romocea
br44114 at gmail.com
Tue Jan 10 20:24:49 CET 2006
See
http://en.wikipedia.org/wiki/Levenshtein_distance
http://thread.gmane.org/gmane.comp.lang.r.general/31499
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Werner
> Wernersen
> Sent: Tuesday, January 10, 2006 2:00 PM
> To: Gabor Grothendieck
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] matching country name tables from different sources
>
> Thanks for the nice code, Gabor!
>
> Unfortunately, it seems not to work for my purpose,
> confuses lots of countries when I compare two lists of over
> 150 countries each.
> Do you have any other suggestions?
>
>
>
> Gabor Grothendieck <ggrothendieck at gmail.com> schrieb: If
> they were the same you could use merge. To figure out
> the correspondence automatically or semiautomatically, try this:
>
> x <- c("Canada", "US", "Mexico")
> y <- c("Kanada", "United States", "Mehico")
> result <- outer(x, y, function(x,y) mapply(lcs2, x, y))
> result[] <- sapply(result, nchar)
> # try both which.max and which.min and if you are lucky
> # one of them will give unique values and that is the one to use
> # In this case which.max does.
> apply(result, 1, which.max) # 1 2 3
>
> # calculate longest common subsequence between 2 strings
> lcs2 <- function(s1,s2) {
> longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
> # Make sure args are strings
> a <- as.character(s1); an <- nchar(s1)+1
> b <- as.character(s2); bn <- nchar(s2)+1
>
>
> # If one arg is an empty string, returns the length of the other
> if (nchar(a)==0) return(nchar(b))
> if (nchar(b)==0) return(nchar(a))
>
>
> # Initialize matrix for calculations
> m <- matrix("", nrow=an, ncol=bn)
>
> for (i in 2:an)
> for (j in 2:bn)
> m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
> paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")
> else
> longest(m[i-1,j], m[i,j-1])
>
> # Returns the distance
> m[an,bn]
> }
>
>
>
> On 1/10/06, Werner Wernersen
> wrote:
> > Hi,
> >
> > Before I reinvent the wheel I wanted to kindly ask you for
> your opinion if there is a simple way to do it.
> >
> > I want to merge a larger number of tables from different
> data sources in R and the matching criterium are country
> names. The tables are of different size and sometimes the
> country names do differ slightly.
> >
> > Has anyone done this or any recommendation on what
> commands I should look at to automize this task as much as possible?
> >
> > Thanks a lot for your effort in advance.
> >
> > All the best,
> > Werner
> >
> >
> >
> > ---------------------------------
> > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von
> PC zu PC!
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
>
>
> ---------------------------------
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
More information about the R-help
mailing list