[R] Data linkage functions for probabilistic linkage using person identifiers

Doran, Harold HDoran at air.org
Wed Nov 18 23:26:26 CET 2009


Interesting enough, I just posted a package to CRAN with a function that might be useful. It is called MiscPsycho and is for psychometric work. The updated version of the package should be available in a day or so. It has a function called stringMatch which just implements the Levenshtein distance or a normalized version of the distance (what I call the LND). Then, there is a function called stringProbs which gives the probability of observing a given LND.

In education, we merge data sets all the time using a unique ID. It turns out, however, that the unique ID is not so unique. It is often shared by many kids over time, duplicated within a year, etc. So, we need to first merge using the ID and then validate that we have merged properly using some other mechanism. I think the LND is very useful for this purpose.

So, here is an example of the function in this package:

### A perfect match gives an LND of 1
> stringMatch('William Clinton', 'William Clinton', normalize='YES')
[1] 1

### A close match gives an LND less than 1
> stringMatch('William Clinton', 'Bill Clinton', normalize='YES')
[1] 0.7333333

If your database is small, you can actually look at the records and see if values less than 1 are really the same name spelled differently, misspelled, etc.

But, if your data set has hundreds of thousands of records that becomes impossible. So, what I do is compute the probability that you would observe an LND of .7 or higher. This is implemented in the stringProbs function. Let's say the probability of observing an LND of .7 is .05 and lower values are even higher. Assuming you are willing to live with this much risk, you might then subset your data and retain records as "valid merges" only if the LND value is greater than .7.

The record linking literature is very big, but it is extremely small in education. So, I have a paper in press demonstrating this application and comparing it to other linking methods, like use of Soundex codes. In the paper, I also discuss how you would combine other demographic information, such as birthdates, etc to further explore probabilities of a correct match.

Harold



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius
Sent: Wednesday, November 18, 2009 4:32 PM
To: Dagan A WRIGHT
Cc: r-help at r-project.org
Subject: Re: [R] Data linkage functions for probabilistic linkage using person identifiers


On Nov 18, 2009, at 1:21 PM, Dagan A WRIGHT wrote:

> I am somewhat new to R although using and liking already.  I am  
> curious if there are any probabilistic packages similar in function  
> to others such and Link King (http://www.the-link-king.com/).  I am  
> looking for functions in SSN, First/Last name, date of birth, and a  
> couple other indicators for matching.
>

Cannot comment on similarities to Link King but have used the  
functions found with this search in similar applications:

RSiteSearch("Levenshtein")  #yes, that is spelled correctly


> Thanks
>
> Dagan Wright, Ph.D., M.S.P.H.
> Lead Addictions Research Analyst, Analysis & Evaluation Unit
> Addictions & Mental Health Division (AMH)
> 500 Summer St. NE E86
> Salem, Oregon 97301-1118
>
> Office number: 503-945-5726
> Fax number:     503-378-8467
> dagan.a.wright at state.or.us
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list