[R] Tuning string matching

Thu Jan 6 14:41:47 CET 2005

Oh this is excellent, Stephen. Now I can create a code to break the initial
strings:
> name1 <- "Harry Harrington"
> name2 <- "Harington Harry"
> str1 <- unlist(strsplit(name1, " "))
> str2 <- unlist(strsplit(name2, " "))
> str1
[1] "Harry"      "Harrington"
> str2
[1] "Harington" "Harry"

and compare the words using your function. Brilliant.
Cheers,
Adrian

Quoting Stephen Upton <supton at referentia.com>:

> Adrian,
>
> As an exercise, I took the pseudocode on the wiki pages for the Levenshtein
> distance and programmed it in R. The code is below. I tested it for just 2
> strings, so I'm not claiming that it *really* works, but it seems to. As you
> can see, I didn't add any error checking, and there is likely some cool R
> shortcuts that could be added.
>
> As to your problem, I'd also suggest that you might want to apply the below
> function to possible combinations of words rather than attempting to apply
> the function to a complete name; that should alleviate the "first.name
> last.name", "last.name first.name" problem.
>
> > levenshtein.distance("Harrington","Harington")
> [1] 1
>
> > levenshtein.distance("Harrington Harry","Harry Harington")
> [1] 11
>
> HTH
> Steve
>
>
> levenshtein.distance <- function(string.1, string.2, subst.cost=1) {
>     c1 <- strsplit(string.1,split="")[[1]]
>     c2 <- strsplit(string.2,split="")[[1]]
>     n <- length(c1)
>     m <- length(c2)
>
>     d <- array(0,dim=c(n+1,m+1))
>
>     d[,1] <- 1:(n+1)
>     d[1,] <- 1:(m+1)
>     d[1,1] <- 0
>
>
>     for (i in 2:(n+1)) {
>
>         for (j in 2:(m+1)) {
>
>             if (c1[i-1] == c2[j-1]) cost <- 0 else cost <- subst.cost
>
>             d[i,j] <- min(d[i-1,j] + 1,    # insertion
>                           d[i,j-1] + 1,    # deletion
>                           d[i-1,j-1] + cost) # substitution
> 	}
>     }
>
>     d[n+1,m+1]
>
> }
>
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-
> > bounces at stat.math.ethz.ch] On Behalf Of adi at roda.ro
> > Sent: Thursday, January 06, 2005 4:32 AM
> > To: Jonathan Baron
> > Cc: McGehee, Robert; bogdan romocea; r-help at stat.math.ethz.ch
> > Subject: Re: [R] Tuning string matching
> >
> >
> > Thank you all for your replies. Indeed I used:
> > http://finzi.psych.upenn.edu/nmz.html
> > as a search site. I used "string match", instead of "fuzzy string match".
> >
> > Fuzzy matching seems to me a rather complicated matter, whereas my initial
> > idea
> > about solving this problem was a bit simpler:
> > - check all characters in both strings (a 2 dimensional matrix of
> > characters)
> > - if 90% (or any other percent) of the characters in both strings are
> > similar
> > (in terms of distances between each character from the first string to all
> > characters from the second string), then the two strings will be declared
> > as a
> > match
> >
> > I just found out that this algorithm is called the "Levenshtein distance",
> > and I
> > know there is a PHP function called "levenshtein" (I thought it already
> > might
> > have been implemented in R).
> >
> > For anyone that have a clue on how to read this stuff:
> > http://ro.php.net/levenshtein
> >
> > I tried to use agrep:
> > > agrep("Harry Harrington", "Harry Harrington")
> > [1] 1
> > > agrep("Harry Harrington", "Harrington Harry")
> > numeric(0)
> >
> > So it seems not to be what I'm looking for (I'll try harder with edit
> > distance,
> > though)
> >
> > Best regards,
> > Adrian
> >
> >
> > Quoting Jonathan Baron <baron at psych.upenn.edu>:
> >
> > > Sorry for joining late, but I wanted to see if my search page
> > > could help.  (I don't know which search archive you looked at.)
> > > I entered
> > > fuzzy string match*
> > > and got a few things that look relevant, including the agrep
> > > function.
> > >
> > > As for the second part of the question, that seems to be a coding
> > > problem that is dependent on the current form of your data.
> > > Write me off the list and I'll send you an R script I use for
> > > similar things (making PayPal payments).
> > >
> > > Jon
> > >
> > >  Dear list,
> > >
> > >  I spent about two hours searching on the message archive, with no
> > >  avail.
> > >  I have a list of people that have to pass an on-line test, but only a
> > >  fraction
> > >  of them do it. Moreover, as they input their names, the resulting
> > >  string do not
> > >  always match the names I have in my database.
> > >
> > >  I would like to do two things:
> > >
> > >  1. Match any strings that are 90% the same
> > >  Example:
> > >  name1 <- "Harry Harrington"
> > >  name2 <- "Harry Harington"
> > >  I need a function that would declare those strings as a match (ideally
> > >  having an
> > >  argument that would allow introducing 80% instead of 90%)
> > >
> > > --
> > > Jonathan Baron, Professor of Psychology, University of Pennsylvania
> > > Home page: http://www.sas.upenn.edu/~baron
> > > R search page: http://finzi.psych.upenn.edu/
> > >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! http://www.R-project.org/posting-
> > guide.html
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Adrian Dusa (adi at roda.ro)
Romanian Social Data Archive (www.roda.ro)
1, Schitu Magureanu Bd.
010181 Bucharest sector 5
Romania
Tel./Fax: +40 (21) 312.66.1