[R] How to use compare.linkage in RecordLinkage package -- unexpected output
Anders Alexandersson
andersalex at gmail.com
Thu Jan 28 16:18:44 CET 2016
I am using the compare.linkage function in the RecordLinkage package,
and getting a result I know is wrong, so I know I'm misunderstanding
something.
I am using R 3.2.3 for x64 Windows. I am very familar with Stata but not so
much with R.
I can create record pairs from the blocking fields but all pairs are
unknown status (NA).
I cannot create matches or non-matches. I want a simple working example of
how to link datasets using the RecordLinkage package. It seems that the
manual and the R Journal Vol. 2/2 only show how to de-duplicate a single
dataset using the compare.dedup function, not how to link two datasets
together using the compare.linkage function. I can reproduce the examples
in the R Journal article, so my R installation is fine.
The example dataset in the manual have 500 and 10000 observations on 7
variables, but 1 observation and 2 variables will be enough to show the
problem.
My first comparison pattern loooks like this:
id1 id2 fname_c1 bm is_match
1 17 343 1 1 NA
Instead, I want and expect a comparison pattern that looks like this:
id1 id2 fname_c1 bm is_match
1 17 343 1 1 1
My blocking variable is fname_c1 for first component of first name. My
matching variable is bm for birth month. My understanding is that row 1 in
my example output is the first row where fname_c1 matched in the underlying
datasets. I want and expect is_match to be 1 when the matching variable
bm=1 in both linkage datasets, as in the example.
For more details, this is what I typed and the R output:
> library(RecordLinkage)
> data(RLdata500)
> data(RLdata10000)
> RLdata500[17, ]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
17 ALEXANDER <NA> MUELLER <NA> 1974 9 9
> RLdata10000[343, ]
fname_c1 fname_c2 lname_c1 lname_c2 by bm bd
343 ALEXANDER <NA> BAUMANN <NA> 1957 9 7
> rpairs <- compare.linkage(RLdata500,RLdata10000,blockfld=c(1),
exclude=c(2:5,7))
> rpairs$pairs[c(1:2), ] # Why is_match=NA? (should be 1)
id1 id2 fname_c1 bm is_match
1 17 343 1 1 NA
2 17 2385 1 0 NA
> rpairs <- epiWeights(rpairs) # (Weight calculation)
> summary(rpairs) # (0 matches in Linkage Dataset)
Linkage Data Set
500 records in data set 1
10000 records in data set 2
47890 record pairs
0 matches
0 non-matches
47890 pairs with unknown status
Weight distribution:
[omitted here to save space]
References:
1. Manual for Package ‘RecordLinkage’
(Available online at
https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf)
2. R Journal article Article "The RecordLinkage Package: Detecting Errors
in Data"
(Available online in PDF at
https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf
)
I saw something in the manual and R journal article about identity argument
for true match results, but I guess I only need that for reference ("gold
standard") datasets. There is a non-missing value (bm=1) for my example in
both underlying datasets, so that is not why the result is NA. What am I
missing? How does one link two simple datasets using compare.linkage?
[[alternative HTML version deleted]]
More information about the R-help
mailing list