[R] challenging data merging/joining problem

Mon Jul 6 00:11:48 CEST 2020

On 2020-07-05 14:50 -0400, Christopher W. Ryan wrote:
> I've been conducting relatively simple 
> COVID-19 surveillance for our jurisdiction. 

Dear Christopher,

As I am a bit unfamiliar when it comes to the 
tidyverse, I wrote these lines using regular 
data.frames:

	### Convert to data.frame
	dataSystemA <- as.data.frame(dataSystemA)
	dataSystemB <- as.data.frame(dataSystemB)

	### Add some unique columns to show how 
	#   they are formatted later in this pipe.
	dataSystemA$someIncompleteInfo <- 1:4
	dataSystemB$other_incomplete_info <-
	  c("Yes", "No", "Perhaps", "Sometimes", "Yes")

	### Add the dfs to a list, as perhaps the 
	#   data kan be read somehow using 
	#   something like
	#   sapply(c("A", "B"), read.from.somewhere)
	dat <- list("A"=dataSystemA,
	            "B"=dataSystemB)

	### Define a new dataSystem column in boths dfs
	dat <- sapply(names(dat), function(n, dat) {
	  dat[[n]]$dataSystem <- n
	  return(list(dat[[n]]))
	}, dat=dat)

	### Read from a csv file column names 
	#   where you have defined which ones 
	#   are conceptually identical.
	text <- "A,B
	lastName,last_name
	firstName,first_name
	dob,birthdate
	onsetDate,date_of_onset
	symptomatic,symptoms_present"
	conceptually.identical <- read.csv(text=text)

	### Rename dataSystemA columns to the 
	#   dataSystemB naming convention.
	idx <- match(x=conceptually.identical$A,
	             table=colnames(dat$A))
	colnames(dat$A)[idx] <-
	  conceptually.identical[idx,"B"]

	### Find all column names, and fill the
	#   ones that does not exists in each 
	#   df with NA, order the dfs by this 
	#   vector, then rbind the dfs.
	cn <- unique(unlist(lapply(dat, colnames)))
	dat <- sapply(dat, function(x, cn) {
	  x[,cn[!(cn %in% colnames(x))]] <- NA
	  list(x[,cn])
	}, cn=cn)
	dat <- do.call(rbind, dat)

	### Order unified df decreasingly by 
	#   last_name and birthdate
	dat <- dat[order(dat$last_name,
	  dat$birthdate, decreasing=FALSE),]
	rownames(dat) <- NULL

	dat

which yields

	   last_name first_name  birthdate date_of_onset symptoms_present someIncompleteInfo dataSystem other_incomplete_info
	1    DIGGORY     cedric 2011-12-16    2020-07-12             TRUE                 NA          B                   Yes
	2   GRAINGER   hermione 2010-12-05    2020-07-08               NA                  3          A                  <NA>
	3   GRAINGER   hermione 2010-12-05    2020-07-08             TRUE                 NA          B                   Yes
	4 LONGBOTTOM    neville 2011-01-24    2020-07-09               NA                  4          A                  <NA>
	5 LONGBOTTOM    neville 2011-01-24    2020-07-09             TRUE                 NA          B                    No
	6   LOVEGOOD       luna 2011-03-15    2020-07-11            FALSE                 NA          B             Sometimes
	7     MALFOY      draco 2011-07-04    2020-07-10            FALSE                 NA          B               Perhaps
	8     POTTER      harry 2010-12-16    2020-07-06             TRUE                  1          A                  <NA>
	9    WEASLEY        ron 2010-12-30    2020-07-07            FALSE                  2          A                  <NA>

When comparing the incomplete columns in each 
data system, it might be useful to do some 
reshaping like this:

	cols <- c("last_name", "birthdate", "dataSystem", "date_of_onset")
	reshape(dat[,cols],
	        idvar=c("last_name", "birthdate"),
	        timevar="dataSystem",
	        direction="wide")

which yields

	   last_name  birthdate date_of_onset.B date_of_onset.A
	1    DIGGORY 2011-03-17      2020-07-13            <NA>
	2   GRAINGER 2010-12-06      2020-07-09      2020-07-09
	4 LONGBOTTOM 2011-01-25      2020-07-10      2020-07-10
	6   LOVEGOOD 2010-10-15      2020-07-12            <NA>
	7     MALFOY 2010-12-25      2020-07-11            <NA>
	8     POTTER 2011-05-09            <NA>      2020-07-07
	9    WEASLEY 2012-04-05            <NA>      2020-07-08

Best,
Rasmus

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200706/a49f40f1/attachment.sig>