[R] String search: Return "closest" match

Richard.Cotton at hsl.gov.uk Richard.Cotton at hsl.gov.uk
Tue Aug 26 15:10:11 CEST 2008


> I have to match names where names can be recorded with errors or 
additions.
> Now I am searching for a string search function which returns always
> the "closest" match. E.g. searching for  "Washington" it should 
> return only Washington but not Washington, D.C. But it also could be
> that the list contains only "Hamburg" but the record I am searching 
> for is "Hamburg OTM" and then we still want to find "Hamburg". Or 
> maybe the list contains "Hamburg" and "Hamberg" but we are searching
> for "Hamburg" and thus only this should this one should be returned.
> 
> agrep() returns all "close" matches but unfortunately does not 
> return the degree of closeness otherwise selection would be easy.
> Is there such a function already implemented?

The Levenshtein distance is a common metric for determining how close two 
string are (in fact, agrep uses this).  There's a function to calculate it 
on the R wiki.
http://wiki.r-project.org/rwiki/doku.php?id=tips:data-strings:levenshtein

You can use this to find the closest string.  (If your set of cities is 
large, it may be quickest to use agrep to narrow the selection first, 
since the pure R implementation of levenshtein is likely to be slow.)

Regards,
Richie.

Mathematical Sciences Unit
HSL


------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}



More information about the R-help mailing list