[R] Is there a package that can do Fuzzy name matching to standardize names in a single column
Spencer Graves
@pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Wed Jun 15 18:59:54 CEST 2022
The Ecfun package has functions matchName, matchName1, parseName, and
subNonStandardNames that were designed to deal especially with the
problem of accents, knowing that, for example, in "Anastasio Somoza
Debayle", "Somoza" is a surname, not a middle name, and names that get
mangled like, "Andr_ Bruce C_rdenas".
Regarding disambiguating names like "John Good", Wikidata is
wonderful for that. I have not spent much time trying to access
Wikidata from R, but sos::findFn('Wikidata') identified 12 different
packages that mention Wikidata in a help page. One or more of those may
help you do what you want.
Spencer Graves
On 6/15/22 11:24 AM, Jeff Newmiller wrote:
> This is an intractable problem... you cannot know that "John Good" is the same person as "John B. Good"... and even when you augment their identity with information like which town they are in or what the first name of their spouse is you could be mislead by such information in multiple ways.
>
> Most of the time such lists are managed by creating an internal "unique id" for each person. But since there is no definitive way to do this, it is handled as an ongoing process, with multiple algorithms applied with or without human supervision and always with some risk of increasing the error rate by any specific algorithm than was originally present in the data.
>
> One common approach is using regular expressions or approximate match algorithms like base::agrep to filter possible like identities for people to check into further.
>
> Another approach could be a clustering algorithm as Bert suggested. It has been awhile since I had to do this kind of work... I suppose neural nets might also be applied to this problem these days. But an agrep pre-filtering human augmentation should not be discounted... 10k entries is not _that_ many.
>
> On June 15, 2022 8:43:14 AM PDT, Gregg Powell via R-help <r-help using r-project.org> wrote:
>>
>> Hello Ashim and kind regards for you taking the time to answer back.
>>
>>
>>> library(fuzzyjoin)
>>> ?stringdist_left_join
>>
>> -this will join two tables, but what I am trying to do is just standardize the similarly spelled duplicate names in just the first column of a single table.
>>
>> I don't think fuzzyjoin will help me in that regard.
>>
>> Thanks.
>> Gregg
>> Arizona, USA
>>
>> ------- Original Message -------
>> On Wednesday, June 15th, 2022 at 8:04 AM, Ashim Kapoor <ashimkapoor using gmail.com> wrote:
>>
>>
>>>
>>
>>>
>>
>>> Dear Gregg,
>>>
>>
>>> Check this out:
>>>
>>
>>> library(fuzzyjoin)
>>> ?stringdist_left_join
>>>
>>
>>> Best Regards,
>>> Ashim
>>>
>>
>>> On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help
>>> r-help using r-project.org wrote:
>>>
>>
>>>> Have data sets where there are names, in the first column, client names in the second, and Client start date in the third.
>>>>
>>
>>>> There are thousands of these records with thousands of names/clients/client start dates. The name is entered each time the person begins with a new client such that each person has many entries in the name column. Often the names were not entered in a consistent way. With and without middle initial, middle name, or various abbreviations such as ",RN" at the end of the name.
>>>>
>>
>>>> Is there a package that can do fuzzy name matching so that the names in name column get replaced with a "standardized" format - where some type of machine learning can pick the most common spelling of each repeat name and replace the different variations with the common spelling?
>>>>
>>
>>>> I included an example below. First table includes the names with the various spellings. Second table depicts what I hope to achieve.
>>>>
>>
>>>> Again - this is on a large scale - there are something like 10,000 records with names that need to be standardized.
>>>>
>>
>>>> Name
>>>>
>>
>>>> Client
>>>>
>>
>>>> Client Start Date
>>>>
>>
>>>> John Good
>>>>
>>
>>>> Client 1
>>>>
>>
>>>> 1/1/2020
>>>>
>>
>>>> Joe Jackson
>>>>
>>
>>>> Client 2
>>>>
>>
>>>> 6/1/2020
>>>>
>>
>>>> Bob A. Barker
>>>>
>>
>>>> Client 3
>>>>
>>
>>>> 8/1/2020
>>>>
>>
>>>> John B. Good
>>>>
>>
>>>> Client 4
>>>>
>>
>>>> 10/1/2020
>>>>
>>
>>>> Joe J. Jackson
>>>>
>>
>>>> Client 5
>>>>
>>
>>>> 12/1/2020
>>>>
>>
>>>> Bob Allen Barker
>>>>
>>
>>>> Client 6
>>>>
>>
>>>> 1/1/2021
>>>>
>>
>>>> John Good
>>>>
>>
>>>> Client 7
>>>>
>>
>>>> 5/1/2021
>>>>
>>
>>>> Joe Jack Jackson
>>>>
>>
>>>> Client 8
>>>>
>>
>>>> 8/1/2021
>>>>
>>
>>>> Bob Barker
>>>>
>>
>>>> Client 9
>>>>
>>
>>>> 12/1/2021
>>>>
>>
>>>> Name
>>>>
>>
>>>> Client
>>>>
>>
>>>> Client Start Date
>>>>
>>
>>>> John Good
>>>>
>>
>>>> Client 1
>>>>
>>
>>>> 1/1/2020
>>>>
>>
>>>> Joe J. Jackson
>>>>
>>
>>>> Client 2
>>>>
>>
>>>> 6/1/2020
>>>>
>>
>>>> Bob A. Barker
>>>>
>>
>>>> Client 3
>>>>
>>
>>>> 8/1/2020
>>>>
>>
>>>> John Good
>>>>
>>
>>>> Client 4
>>>>
>>
>>>> 10/1/2020
>>>>
>>
>>>> Joe J. Jackson
>>>>
>>
>>>> Client 5
>>>>
>>
>>>> 12/1/2020
>>>>
>>
>>>> Bob A. Barker
>>>>
>>
>>>> Client 6
>>>>
>>
>>>> 1/1/2021
>>>>
>>
>>>> John Good
>>>>
>>
>>>> Client 7
>>>>
>>
>>>> 5/1/2021
>>>>
>>
>>>> Joe J. Jackson
>>>>
>>
>>>> Client 8
>>>>
>>
>>>> 8/1/2021
>>>>
>>
>>>> Bob A. Barker
>>>>
>>
>>>> Client 9
>>>>
>>
>>>> 12/1/2021
>>>>
>>
>>>> THANKS!
>>>>
>>
>>>> Gregg Powell
>>>>
>>
>>>> Arizona, USA______________________________________________
>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list