[R] How to use compare.linkage in RecordLinkage package -- unexpected output

Anders Alexandersson andersalex at gmail.com
Thu Jan 28 16:18:44 CET 2016


I am using the compare.linkage function in the RecordLinkage package,
and getting a result I know is wrong, so I know I'm misunderstanding
something.
I am using R 3.2.3 for x64 Windows. I am very familar with Stata but not so
much with R.

I can create record pairs from the blocking fields but all pairs are
unknown status (NA).
I cannot create matches or non-matches. I want a simple working example of
how to link datasets using the RecordLinkage package. It seems that the
manual and the R Journal Vol. 2/2 only show how to de-duplicate a single
dataset using the compare.dedup function, not how to link two datasets
together using the compare.linkage function. I can reproduce the examples
in the R Journal article, so my R installation is fine.

The example dataset in the manual have 500 and 10000 observations on 7
variables, but 1 observation and 2 variables will be enough to show the
problem.
My first comparison pattern loooks like this:
  id1  id2 fname_c1 bm is_match
1  17  343        1  1       NA

Instead, I want and expect a comparison pattern that looks like this:
  id1  id2 fname_c1 bm is_match
1  17  343        1  1       1

My blocking variable is fname_c1 for first component of first name. My
matching variable is bm for birth month. My understanding is that row 1 in
my example output is the first row where fname_c1 matched in the underlying
datasets. I want and expect is_match to be 1 when the matching variable
bm=1 in both linkage datasets, as in the example.

For more details, this is what I typed and the R output:
> library(RecordLinkage)
> data(RLdata500)
> data(RLdata10000)
> RLdata500[17, ]
    fname_c1 fname_c2 lname_c1 lname_c2   by bm bd
17 ALEXANDER     <NA>  MUELLER     <NA> 1974  9  9
> RLdata10000[343, ]
     fname_c1 fname_c2 lname_c1 lname_c2   by bm bd
343 ALEXANDER     <NA>  BAUMANN     <NA> 1957  9  7
> rpairs <- compare.linkage(RLdata500,RLdata10000,blockfld=c(1),
exclude=c(2:5,7))
> rpairs$pairs[c(1:2), ] # Why is_match=NA? (should be 1)
  id1  id2 fname_c1 bm is_match
1  17  343        1  1       NA
2  17 2385        1  0       NA
> rpairs <- epiWeights(rpairs) # (Weight calculation)
> summary(rpairs) # (0 matches in Linkage Dataset)

Linkage Data Set

500 records in data set 1
10000 records in data set 2
47890 record pairs

0 matches
0 non-matches
47890 pairs with unknown status


Weight distribution:
[omitted here to save space]

References:
1. Manual for Package ‘RecordLinkage’
(Available online at
https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf)
2. R Journal article Article "The RecordLinkage Package: Detecting Errors
in Data"
(Available online in PDF at
https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf
)

I saw something in the manual and R journal article about identity argument
for true match results, but I guess I only need that for reference ("gold
standard") datasets. There is a non-missing value (bm=1) for my example in
both underlying datasets, so that is not why the result is NA. What am I
missing? How does one link two simple datasets using compare.linkage?

	[[alternative HTML version deleted]]



More information about the R-help mailing list