[R] [External] challenging data merging/joining problem

Mon Jul 6 01:03:05 CEST 2020

Have you talked directly to the designers of the new database?
One would hope that they had a clear migration path in mind.
Perhaps they just didn't document it to your satisfaction.

Rich

On Sun, Jul 5, 2020 at 2:51 PM Christopher W. Ryan <cryan using binghamton.edu> wrote:
>
> I've been conducting relatively simple COVID-19 surveillance for our
> jurisdiction. We get data on lab test results automatically, and then
> interview patients to obtain other information, like clinical details.
> We had been recording all data in our long-time data system (call it
> dataSystemA). But as of a particular date, there was a major change in
> the data system we were compelled to use. Call the new one dataSystemB.
> dataSystemA and dataSystemB contain very similar information,
> conceptually, but the variable names are all different, and there are
> some variables in one that do not appear in the other. Total number of
> variables in each is about 50-70.
>
> Furthermore, for about 2 weeks prior to the transition, lab test results
> started being deposited into dataSystemB while dataSystemA was still
> being used to record the full information from the interviews.
> Subsequent to the transition, lab test results and interview information
> are being recorded in dataSystemB, while the lab test results alone are
> still being automatically deposited into dataSystemA.
>
> Diagrammatically:
>
> dataSystemA usage: ____________________ ............>>
>
> dataSystemB usage:               ......._____________>>
>
> where ________ represents full data and ..... represents partial data,
> and >> represents the progress of time.
>
>
> The following will create MWE of the data wrangling problem, with the
> change in data systems made to occur overnight on 2020-07-07:
>
> library(dplyr)
> dataSystemA <- tibble(lastName = c("POTTER", "WEASLEY", "GRAINGER",
> "LONGBOTTOM"),
>                       firstName = c("harry", "ron", "hermione", "neville"),
>                       dob = as.Date(Sys.Date() + c(sample(-3650:-3000,
> size = 2), -3500, -3450)),
>                       onsetDate = as.Date(Sys.Date() + 1:4),
>                       symptomatic = c(TRUE, FALSE, NA, NA) )
> dataSystemB <- tibble(last_name = c("GRAINGER", "LONGBOTTOM", "MALFOY",
> "LOVEGOOD", "DIGGORY"),
>                       first_name = c("hermione", "neville", "draco",
> "luna", "cedric"),
>                       birthdate = as.Date(Sys.Date() + c(-3500, -3450,
> sample(-3650:-3000, size = 3))),
>                       date_of_onset = as.Date(Sys.Date() + 3:7),
>                       symptoms_present = c(TRUE, TRUE, FALSE, FALSE, TRUE))
>
>
>
> Obviously, this is all the same public health problem, so I don't want a
> big uninterpretable gap in my reports. I am looking for advice on the
> best strategy for combining two different tibbles with some overlap in
> observations (some patients appear in both data systems, with varying
> degrees of completeness of data) and with some of the same things being
> mesaured and recorded in the two data systems, but with different
> variable names.
>
> I've thought of two different strategies, neither of which seems ideal
> but either of which might work:
>
> 1. change the variable names in dataSystemB to match their
> conceptually-identical variables in dataSystemA, and then use some
> version of bind_rows()
>
> 2. Create a unique identifier from last names, first names, and dates of
> birth, use some type of full_join(), matching on that identifier,
> obtaining all columns from both tibbles, and then "collapse"
> conceptually-identical variables like onsetDate and date_of_onset using
> coalesce()
>
> Sorry for my long-windedness. Grateful for any advice.
>
> --Chris Ryan
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.