[R] challenging data merging/joining problem
Rasmus Liland
jr@| @end|ng |rom po@teo@no
Mon Jul 6 00:11:48 CEST 2020
On 2020-07-05 14:50 -0400, Christopher W. Ryan wrote:
> I've been conducting relatively simple
> COVID-19 surveillance for our jurisdiction.
Dear Christopher,
As I am a bit unfamiliar when it comes to the
tidyverse, I wrote these lines using regular
data.frames:
### Convert to data.frame
dataSystemA <- as.data.frame(dataSystemA)
dataSystemB <- as.data.frame(dataSystemB)
### Add some unique columns to show how
# they are formatted later in this pipe.
dataSystemA$someIncompleteInfo <- 1:4
dataSystemB$other_incomplete_info <-
c("Yes", "No", "Perhaps", "Sometimes", "Yes")
### Add the dfs to a list, as perhaps the
# data kan be read somehow using
# something like
# sapply(c("A", "B"), read.from.somewhere)
dat <- list("A"=dataSystemA,
"B"=dataSystemB)
### Define a new dataSystem column in boths dfs
dat <- sapply(names(dat), function(n, dat) {
dat[[n]]$dataSystem <- n
return(list(dat[[n]]))
}, dat=dat)
### Read from a csv file column names
# where you have defined which ones
# are conceptually identical.
text <- "A,B
lastName,last_name
firstName,first_name
dob,birthdate
onsetDate,date_of_onset
symptomatic,symptoms_present"
conceptually.identical <- read.csv(text=text)
### Rename dataSystemA columns to the
# dataSystemB naming convention.
idx <- match(x=conceptually.identical$A,
table=colnames(dat$A))
colnames(dat$A)[idx] <-
conceptually.identical[idx,"B"]
### Find all column names, and fill the
# ones that does not exists in each
# df with NA, order the dfs by this
# vector, then rbind the dfs.
cn <- unique(unlist(lapply(dat, colnames)))
dat <- sapply(dat, function(x, cn) {
x[,cn[!(cn %in% colnames(x))]] <- NA
list(x[,cn])
}, cn=cn)
dat <- do.call(rbind, dat)
### Order unified df decreasingly by
# last_name and birthdate
dat <- dat[order(dat$last_name,
dat$birthdate, decreasing=FALSE),]
rownames(dat) <- NULL
dat
which yields
last_name first_name birthdate date_of_onset symptoms_present someIncompleteInfo dataSystem other_incomplete_info
1 DIGGORY cedric 2011-12-16 2020-07-12 TRUE NA B Yes
2 GRAINGER hermione 2010-12-05 2020-07-08 NA 3 A <NA>
3 GRAINGER hermione 2010-12-05 2020-07-08 TRUE NA B Yes
4 LONGBOTTOM neville 2011-01-24 2020-07-09 NA 4 A <NA>
5 LONGBOTTOM neville 2011-01-24 2020-07-09 TRUE NA B No
6 LOVEGOOD luna 2011-03-15 2020-07-11 FALSE NA B Sometimes
7 MALFOY draco 2011-07-04 2020-07-10 FALSE NA B Perhaps
8 POTTER harry 2010-12-16 2020-07-06 TRUE 1 A <NA>
9 WEASLEY ron 2010-12-30 2020-07-07 FALSE 2 A <NA>
When comparing the incomplete columns in each
data system, it might be useful to do some
reshaping like this:
cols <- c("last_name", "birthdate", "dataSystem", "date_of_onset")
reshape(dat[,cols],
idvar=c("last_name", "birthdate"),
timevar="dataSystem",
direction="wide")
which yields
last_name birthdate date_of_onset.B date_of_onset.A
1 DIGGORY 2011-03-17 2020-07-13 <NA>
2 GRAINGER 2010-12-06 2020-07-09 2020-07-09
4 LONGBOTTOM 2011-01-25 2020-07-10 2020-07-10
6 LOVEGOOD 2010-10-15 2020-07-12 <NA>
7 MALFOY 2010-12-25 2020-07-11 <NA>
8 POTTER 2011-05-09 <NA> 2020-07-07
9 WEASLEY 2012-04-05 <NA> 2020-07-08
Best,
Rasmus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200706/a49f40f1/attachment.sig>
More information about the R-help
mailing list