[R] Tuning string matching

Thomas Lumley tlumley at u.washington.edu
Wed Jan 5 19:54:39 CET 2005


On Wed, 5 Jan 2005 adi at roda.ro wrote:

> Dear list,
>
> I spent about two hours searching on the message archive, with no avail.
> I have a list of people that have to pass an on-line test, but only a fraction
> of them do it. Moreover, as they input their names, the resulting string do not
> always match the names I have in my database.
>
> I would like to do two things:
>
> 1. Match any strings that are 90% the same
> Example:
> name1 <- "Harry Harrington"
> name2 <- "Harry Harington"
> I need a function that would declare those strings as a match (ideally having an
> argument that would allow introducing 80% instead of 90%)

agrep() does something very similar to this.  It has an edit distance 
rather than a % similarity, but you should be able to tune it to do what 
you want.

> 2. Arrange a final table that would take me from:
>
> Table1 (the complete list of people from my database)
> No Name
> 1  Byron C. Andrew
> 2  Friedman Bob
> 3  Harrington Harry
>
> Table2 (the people having been tested)
> No Name               Score
> 1  Harry Harington    13
> 2  Byron Andrew       28
>
> to:
>
> No Name1              Name2              Score
> 1  Byron C. Andrew    Byron Andrew       28
> 2  Friedman Bob
> 3  Harrington Harry   Harry Harington    13
>

This may not be very well-defined, since 90% agreement is not an 
equivalence relation.

Assuming that sets of matches are either identical or disjoint you could 
construct a numeric variable in table 2 that indicates which row of table 
1 to match, by using agrep() in a loop.


 	-thomas




More information about the R-help mailing list