[R] challenging data merging/joining problem

Christopher W. Ryan cry@n @end|ng |rom b|ngh@mton@edu
Sun Jul 5 20:50:51 CEST 2020


I've been conducting relatively simple COVID-19 surveillance for our
jurisdiction. We get data on lab test results automatically, and then
interview patients to obtain other information, like clinical details.
We had been recording all data in our long-time data system (call it
dataSystemA). But as of a particular date, there was a major change in
the data system we were compelled to use. Call the new one dataSystemB.
dataSystemA and dataSystemB contain very similar information,
conceptually, but the variable names are all different, and there are
some variables in one that do not appear in the other. Total number of
variables in each is about 50-70.

Furthermore, for about 2 weeks prior to the transition, lab test results
started being deposited into dataSystemB while dataSystemA was still
being used to record the full information from the interviews.
Subsequent to the transition, lab test results and interview information
are being recorded in dataSystemB, while the lab test results alone are
still being automatically deposited into dataSystemA.

Diagrammatically:

dataSystemA usage: ____________________ ............>>

dataSystemB usage:               ......._____________>>

where ________ represents full data and ..... represents partial data,
and >> represents the progress of time.


The following will create MWE of the data wrangling problem, with the
change in data systems made to occur overnight on 2020-07-07:

library(dplyr)
dataSystemA <- tibble(lastName = c("POTTER", "WEASLEY", "GRAINGER",
"LONGBOTTOM"),
                      firstName = c("harry", "ron", "hermione", "neville"),
                      dob = as.Date(Sys.Date() + c(sample(-3650:-3000,
size = 2), -3500, -3450)),
                      onsetDate = as.Date(Sys.Date() + 1:4),
                      symptomatic = c(TRUE, FALSE, NA, NA) )
dataSystemB <- tibble(last_name = c("GRAINGER", "LONGBOTTOM", "MALFOY",
"LOVEGOOD", "DIGGORY"),
                      first_name = c("hermione", "neville", "draco",
"luna", "cedric"),
                      birthdate = as.Date(Sys.Date() + c(-3500, -3450,
sample(-3650:-3000, size = 3))),
                      date_of_onset = as.Date(Sys.Date() + 3:7),
                      symptoms_present = c(TRUE, TRUE, FALSE, FALSE, TRUE))



Obviously, this is all the same public health problem, so I don't want a
big uninterpretable gap in my reports. I am looking for advice on the
best strategy for combining two different tibbles with some overlap in
observations (some patients appear in both data systems, with varying
degrees of completeness of data) and with some of the same things being
mesaured and recorded in the two data systems, but with different
variable names.

I've thought of two different strategies, neither of which seems ideal
but either of which might work:

1. change the variable names in dataSystemB to match their
conceptually-identical variables in dataSystemA, and then use some
version of bind_rows()

2. Create a unique identifier from last names, first names, and dates of
birth, use some type of full_join(), matching on that identifier,
obtaining all columns from both tibbles, and then "collapse"
conceptually-identical variables like onsetDate and date_of_onset using
coalesce()

Sorry for my long-windedness. Grateful for any advice.

--Chris Ryan



More information about the R-help mailing list