[R] pairing columns based on a value
Robert Strother
rstrothe at gmail.com
Wed Dec 17 21:14:22 CET 2014
I have a large dataset (~50,000 rows, 96 columns), of hospital
administrative data.
many of the columns are clinical coding of inpatient event (using ICD-10).
A simplified example of the data is below
> dput(dat_unmatched)
structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
"BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
3L, 1L, 1L), .Label = c("B83.2", "C23.2", "F56.23"), class = "factor"),
X.2 = structure(c(2L, 1L, 2L, 2L), .Label = c("M20.64", "T43.2"
), class = "factor"), X.3 = structure(c(2L, 3L, 3L, 1L), .Label =
c("F56.23",
"R23.1", "Y32.1"), class = "factor"), X.4 = structure(c(1L,
2L, 2L, 3L), .Label = c("M23.5", "T44.2", "Y32.1"), class = "factor"),
X.5 = structure(c(1L, 2L, 1L, 2L), .Label = c("", "Q23.6"
), class = "factor")), .Names = c("ID", "X.1", "X.2", "X.3",
"X.4", "X.5"), class = "data.frame", row.names = c(NA, -4L))
I am interested in a set of codes that start with a "T" or a "Y", and link
them to the preceding column that does not begin with a "T" or "Y". I
suspect I will need to use regular expressions, and likely a loop, but I am
really out of my depth at this point.
I would like the final dataset to look like:
> dput(dat_matched)
structure(list(ID = structure(c(4L, 3L, 2L, 1L), .Label = c("BCM3455",
"BZD2643", "GDR2343", "MCZ4325"), class = "factor"), X.1 = structure(c(2L,
3L, 1L, 1L), .Label = c("B83.2", "C23.2", "M20.64"), class = "factor"),
X.2 = structure(c(1L, 2L, 1L, 1L), .Label = c("T43.2", "Y32.1"
), class = "factor"), X.3 = structure(c(1L, 4L, 2L, 3L), .Label = c("",
"B83.2", "F56.23", "M20.64"), class = "factor"), X.4 = structure(c(1L,
2L, 3L, 3L), .Label = c("", "T44.2", "Y32.1"), class = "factor"),
X.5 = structure(c(1L, 1L, 2L, 1L), .Label = c("", "B83.2"
), class = "factor"), X = structure(c(1L, 1L, 2L, 1L), .Label = c("",
"T44.2"), class = "factor")), .Names = c("ID", "X.1", "X.2",
"X.3", "X.4", "X.5", "X"), class = "data.frame", row.names = c(NA,
-4L))
Any help appreciated.
Matthew
[[alternative HTML version deleted]]
More information about the R-help
mailing list