[R] Tuning string matching
Adrian DUSA
adi at roda.ro
Thu Jan 6 14:41:47 CET 2005
Oh this is excellent, Stephen. Now I can create a code to break the initial
strings:
> name1 <- "Harry Harrington"
> name2 <- "Harington Harry"
> str1 <- unlist(strsplit(name1, " "))
> str2 <- unlist(strsplit(name2, " "))
> str1
[1] "Harry" "Harrington"
> str2
[1] "Harington" "Harry"
and compare the words using your function. Brilliant.
Cheers,
Adrian
Quoting Stephen Upton <supton at referentia.com>:
> Adrian,
>
> As an exercise, I took the pseudocode on the wiki pages for the Levenshtein
> distance and programmed it in R. The code is below. I tested it for just 2
> strings, so I'm not claiming that it *really* works, but it seems to. As you
> can see, I didn't add any error checking, and there is likely some cool R
> shortcuts that could be added.
>
> As to your problem, I'd also suggest that you might want to apply the below
> function to possible combinations of words rather than attempting to apply
> the function to a complete name; that should alleviate the "first.name
> last.name", "last.name first.name" problem.
>
> > levenshtein.distance("Harrington","Harington")
> [1] 1
>
> > levenshtein.distance("Harrington Harry","Harry Harington")
> [1] 11
>
> HTH
> Steve
>
>
> levenshtein.distance <- function(string.1, string.2, subst.cost=1) {
> c1 <- strsplit(string.1,split="")[[1]]
> c2 <- strsplit(string.2,split="")[[1]]
> n <- length(c1)
> m <- length(c2)
>
> d <- array(0,dim=c(n+1,m+1))
>
> d[,1] <- 1:(n+1)
> d[1,] <- 1:(m+1)
> d[1,1] <- 0
>
>
> for (i in 2:(n+1)) {
>
> for (j in 2:(m+1)) {
>
> if (c1[i-1] == c2[j-1]) cost <- 0 else cost <- subst.cost
>
> d[i,j] <- min(d[i-1,j] + 1, # insertion
> d[i,j-1] + 1, # deletion
> d[i-1,j-1] + cost) # substitution
> }
> }
>
> d[n+1,m+1]
>
> }
>
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-
> > bounces at stat.math.ethz.ch] On Behalf Of adi at roda.ro
> > Sent: Thursday, January 06, 2005 4:32 AM
> > To: Jonathan Baron
> > Cc: McGehee, Robert; bogdan romocea; r-help at stat.math.ethz.ch
> > Subject: Re: [R] Tuning string matching
> >
> >
> > Thank you all for your replies. Indeed I used:
> > http://finzi.psych.upenn.edu/nmz.html
> > as a search site. I used "string match", instead of "fuzzy string match".
> >
> > Fuzzy matching seems to me a rather complicated matter, whereas my initial
> > idea
> > about solving this problem was a bit simpler:
> > - check all characters in both strings (a 2 dimensional matrix of
> > characters)
> > - if 90% (or any other percent) of the characters in both strings are
> > similar
> > (in terms of distances between each character from the first string to all
> > characters from the second string), then the two strings will be declared
> > as a
> > match
> >
> > I just found out that this algorithm is called the "Levenshtein distance",
> > and I
> > know there is a PHP function called "levenshtein" (I thought it already
> > might
> > have been implemented in R).
> >
> > For anyone that have a clue on how to read this stuff:
> > http://ro.php.net/levenshtein
> >
> > I tried to use agrep:
> > > agrep("Harry Harrington", "Harry Harrington")
> > [1] 1
> > > agrep("Harry Harrington", "Harrington Harry")
> > numeric(0)
> >
> > So it seems not to be what I'm looking for (I'll try harder with edit
> > distance,
> > though)
> >
> > Best regards,
> > Adrian
> >
> >
> > Quoting Jonathan Baron <baron at psych.upenn.edu>:
> >
> > > Sorry for joining late, but I wanted to see if my search page
> > > could help. (I don't know which search archive you looked at.)
> > > I entered
> > > fuzzy string match*
> > > and got a few things that look relevant, including the agrep
> > > function.
> > >
> > > As for the second part of the question, that seems to be a coding
> > > problem that is dependent on the current form of your data.
> > > Write me off the list and I'll send you an R script I use for
> > > similar things (making PayPal payments).
> > >
> > > Jon
> > >
> > > Dear list,
> > >
> > > I spent about two hours searching on the message archive, with no
> > > avail.
> > > I have a list of people that have to pass an on-line test, but only a
> > > fraction
> > > of them do it. Moreover, as they input their names, the resulting
> > > string do not
> > > always match the names I have in my database.
> > >
> > > I would like to do two things:
> > >
> > > 1. Match any strings that are 90% the same
> > > Example:
> > > name1 <- "Harry Harrington"
> > > name2 <- "Harry Harington"
> > > I need a function that would declare those strings as a match (ideally
> > > having an
> > > argument that would allow introducing 80% instead of 90%)
> > >
> > > --
> > > Jonathan Baron, Professor of Psychology, University of Pennsylvania
> > > Home page: http://www.sas.upenn.edu/~baron
> > > R search page: http://finzi.psych.upenn.edu/
> > >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! http://www.R-project.org/posting-
> > guide.html
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Adrian Dusa (adi at roda.ro)
Romanian Social Data Archive (www.roda.ro)
1, Schitu Magureanu Bd.
010181 Bucharest sector 5
Romania
Tel./Fax: +40 (21) 312.66.1
More information about the R-help
mailing list