[R] [External] challenging data merging/joining problem

Mon Jul 6 11:03:43 CEST 2020

Hi Christopher,
This seems pretty standard and straightforward, unless I am missing
something. You can do the "full join" without changing variable names.
Here's a small code example with two tibbles, a and b, where the
column 'x' in a corresponds to the column 'u' in b.

a <- tibble(x=1:15,y=21:35)
b <- tibble(u=c(1:10,51:55),z=31:45)
foo <- merge(a,b,by.x="x",by.y="u",all.x=TRUE,all.y=TRUE)
foo

#     x  y  z
# 1   1 21 31
# 2   2 22 32
# 3   3 23 33
# 4   4 24 34
# 5   5 25 35
# 6   6 26 36
# 7   7 27 37
# 8   8 28 38
# 9   9 29 39
# 10 10 30 40
# 11 11 31 NA
# 12 12 32 NA
# 13 13 33 NA
# 14 14 34 NA
# 15 15 35 NA
# 16 51 NA 41
# 17 52 NA 42
# 18 53 NA 43
# 19 54 NA 44
# 20 55 NA 45

HTH,
Eric

On Mon, Jul 6, 2020 at 2:07 AM Richard M. Heiberger <rmh using temple.edu> wrote:
>
> Have you talked directly to the designers of the new database?
> One would hope that they had a clear migration path in mind.
> Perhaps they just didn't document it to your satisfaction.
>
> Rich
>
> On Sun, Jul 5, 2020 at 2:51 PM Christopher W. Ryan <cryan using binghamton.edu> wrote:
> >
> > I've been conducting relatively simple COVID-19 surveillance for our
> > jurisdiction. We get data on lab test results automatically, and then
> > interview patients to obtain other information, like clinical details.
> > We had been recording all data in our long-time data system (call it
> > dataSystemA). But as of a particular date, there was a major change in
> > the data system we were compelled to use. Call the new one dataSystemB.
> > dataSystemA and dataSystemB contain very similar information,
> > conceptually, but the variable names are all different, and there are
> > some variables in one that do not appear in the other. Total number of
> > variables in each is about 50-70.
> >
> > Furthermore, for about 2 weeks prior to the transition, lab test results
> > started being deposited into dataSystemB while dataSystemA was still
> > being used to record the full information from the interviews.
> > Subsequent to the transition, lab test results and interview information
> > are being recorded in dataSystemB, while the lab test results alone are
> > still being automatically deposited into dataSystemA.
> >
> > Diagrammatically:
> >
> > dataSystemA usage: ____________________ ............>>
> >
> > dataSystemB usage:               ......._____________>>
> >
> > where ________ represents full data and ..... represents partial data,
> > and >> represents the progress of time.
> >
> >
> > The following will create MWE of the data wrangling problem, with the
> > change in data systems made to occur overnight on 2020-07-07:
> >
> > library(dplyr)
> > dataSystemA <- tibble(lastName = c("POTTER", "WEASLEY", "GRAINGER",
> > "LONGBOTTOM"),
> >                       firstName = c("harry", "ron", "hermione", "neville"),
> >                       dob = as.Date(Sys.Date() + c(sample(-3650:-3000,
> > size = 2), -3500, -3450)),
> >                       onsetDate = as.Date(Sys.Date() + 1:4),
> >                       symptomatic = c(TRUE, FALSE, NA, NA) )
> > dataSystemB <- tibble(last_name = c("GRAINGER", "LONGBOTTOM", "MALFOY",
> > "LOVEGOOD", "DIGGORY"),
> >                       first_name = c("hermione", "neville", "draco",
> > "luna", "cedric"),
> >                       birthdate = as.Date(Sys.Date() + c(-3500, -3450,
> > sample(-3650:-3000, size = 3))),
> >                       date_of_onset = as.Date(Sys.Date() + 3:7),
> >                       symptoms_present = c(TRUE, TRUE, FALSE, FALSE, TRUE))
> >
> >
> >
> > Obviously, this is all the same public health problem, so I don't want a
> > big uninterpretable gap in my reports. I am looking for advice on the
> > best strategy for combining two different tibbles with some overlap in
> > observations (some patients appear in both data systems, with varying
> > degrees of completeness of data) and with some of the same things being
> > mesaured and recorded in the two data systems, but with different
> > variable names.
> >
> > I've thought of two different strategies, neither of which seems ideal
> > but either of which might work:
> >
> > 1. change the variable names in dataSystemB to match their
> > conceptually-identical variables in dataSystemA, and then use some
> > version of bind_rows()
> >
> > 2. Create a unique identifier from last names, first names, and dates of
> > birth, use some type of full_join(), matching on that identifier,
> > obtaining all columns from both tibbles, and then "collapse"
> > conceptually-identical variables like onsetDate and date_of_onset using
> > coalesce()
> >
> > Sorry for my long-windedness. Grateful for any advice.
> >
> > --Chris Ryan
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.