[R] Tuning string matching
Thomas Lumley
tlumley at u.washington.edu
Wed Jan 5 19:54:39 CET 2005
On Wed, 5 Jan 2005 adi at roda.ro wrote:
> Dear list,
>
> I spent about two hours searching on the message archive, with no avail.
> I have a list of people that have to pass an on-line test, but only a fraction
> of them do it. Moreover, as they input their names, the resulting string do not
> always match the names I have in my database.
>
> I would like to do two things:
>
> 1. Match any strings that are 90% the same
> Example:
> name1 <- "Harry Harrington"
> name2 <- "Harry Harington"
> I need a function that would declare those strings as a match (ideally having an
> argument that would allow introducing 80% instead of 90%)
agrep() does something very similar to this. It has an edit distance
rather than a % similarity, but you should be able to tune it to do what
you want.
> 2. Arrange a final table that would take me from:
>
> Table1 (the complete list of people from my database)
> No Name
> 1 Byron C. Andrew
> 2 Friedman Bob
> 3 Harrington Harry
>
> Table2 (the people having been tested)
> No Name Score
> 1 Harry Harington 13
> 2 Byron Andrew 28
>
> to:
>
> No Name1 Name2 Score
> 1 Byron C. Andrew Byron Andrew 28
> 2 Friedman Bob
> 3 Harrington Harry Harry Harington 13
>
This may not be very well-defined, since 90% agreement is not an
equivalence relation.
Assuming that sets of matches are either identical or disjoint you could
construct a numeric variable in table 2 that indicates which row of table
1 to match, by using agrep() in a loop.
-thomas
More information about the R-help
mailing list