Converting dates, times or date-times to ISO 8601

An SDTM DTC variable may include data that is represented in ISO 8601 format as a complete date/time, a partial date/time, or an incomplete date/time. {sdtm.oak} provides the create_iso8601() function that allows flexible mapping of date and time values in various formats to a single date-time ISO 8601 format.

Introduction

To perform conversion to the ISO 8601 format you need to pass two key arguments:

create_iso8601("2000 01 05", .format = "y m d")
#> [1] "2000-01-05"
create_iso8601("22:35:05", .format = "H:M:S")
#> [1] "-----T22:35:05"

By default the .format parameter understands a few reserved characters:

Besides character vectors of dates and times, you may also pass a single vector of date-times, provided you adjust the format:

create_iso8601("2000-01-05 22:35:05", .format = "y-m-d H:M:S")
#> [1] "2000-01-05T22:35:05"

Multiple inputs

If you have dates and times in separate vectors then you will need to pass a format for each vector:

create_iso8601("2000-01-05", "22:35:05", .format = c("y-m-d", "H:M:S"))
#> [1] "2000-01-05T22:35:05"

In addition, like most R functions that take vectors as input, create_iso8601() is vectorized:

date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07")
time <- c("00:12:21", "22:35:05", "03:00:15", "07:09:00")
create_iso8601(date, time, .format = c("y-m-d", "H:M:S"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"

But the number of elements in each of the inputs has to match or you will get an error:

date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07")
time <- "00:12:21"
try(create_iso8601(date, time, .format = c("y-m-d", "H:M:S")))
#> Error in create_iso8601(date, time, .format = c("y-m-d", "H:M:S")) : 
#>   All vectors in `...` must be of the same length.

You can combine individual date and time components coming in as separate inputs; here is a contrived example of year, month and day together, hour, and minute:

year <- c("99", "84", "00", "80", "79", "1944", "1953")
month_and_day <- c("jan 1", "apr 04", "mar 06", "jun 18", "sep 07", "sep 13", "sep 14")
hour <- c("12", "13", "05", "23", "16", "16", "19")
min <- c("0", "60", "59", "42", "44", "10", "13")
create_iso8601(year, month_and_day, hour, min, .format = c("y", "m d", "H", "M"))
#> [1] "1999-01-01T12:00" "1984-04-04T13:60" "2000-03-06T05:59" "1980-06-18T23:42"
#> [5] "1979-09-07T16:44" "1944-09-13T16:10" "1953-09-14T19:13"

The .format argument must be always named; otherwise, it will be treated as if it were one of the inputs and interpreted as missing.

try(create_iso8601("2000-01-05", "y-m-d"))
#> Error in create_iso8601("2000-01-05", "y-m-d") : 
#>   argument ".format" is missing, with no default

Format variations

The .format parameter can easily accommodate variations in the format of the inputs:

create_iso8601("2000-01-05", .format = "y-m-d")
#> [1] "2000-01-05"
create_iso8601("2000 01 05", .format = "y m d")
#> [1] "2000-01-05"
create_iso8601("2000/01/05", .format = "y/m/d")
#> [1] "2000-01-05"

Individual components may come in a different order, so adjust the format accordingly:

create_iso8601("2000 01 05", .format = "y m d")
#> [1] "2000-01-05"
create_iso8601("05 01 2000", .format = "d m y")
#> [1] "2000-01-05"
create_iso8601("01 05, 2000", .format = "m d, y")
#> [1] "2000-01-05"

All other individual characters given in the format are taken strictly, e.g. the number of spaces matters:

date <- c("2000 01 05", "2000  01 05", "2000 01  05", "2000   01   05")
create_iso8601(date, .format = "y m d")
#> [1] "2000-01-05" NA           NA           NA
create_iso8601(date, .format = "y  m d")
#> [1] NA           "2000-01-05" NA           NA
create_iso8601(date, .format = "y m  d")
#> [1] NA           NA           "2000-01-05" NA
create_iso8601(date, .format = "y   m   d")
#> [1] NA           NA           NA           "2000-01-05"

The format can include regular expressions though:

create_iso8601(date, .format = "y\\s+m\\s+d")
#> [1] "2000-01-05" "2000-01-05" "2000-01-05" "2000-01-05"

By default, a streak of the reserved characters is treated as if only one was provided, so these formats are equivalent:

date <- c("2000-01-05", "2001-12-25", "1980-06-18", "1979-09-07")
time <- c("00:12:21", "22:35:05", "03:00:15", "07:09:00")
create_iso8601(date, time, .format = c("y-m-d", "H:M:S"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"
create_iso8601(date, time, .format = c("yyyy-mm-dd", "HH:MM:SS"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"
create_iso8601(date, time, .format = c("yyyyyyyy-m-dddddd", "H:MMMMM:SSSS"))
#> [1] "2000-01-05T00:12:21" "2001-12-25T22:35:05" "1980-06-18T03:00:15"
#> [4] "1979-09-07T07:09:00"

Multiple alternative formats

When an input vector contains values with varying formats, a single format may not be adequate to encompass all variations. In such situations, it’s advisable to list multiple alternative formats. This approach ensures that each format is tried sequentially until one matches the data in the vector.

date <- c("2000/01/01", "2000-01-02", "2000 01 03", "2000/01/04")
create_iso8601(date, .format = "y-m-d")
#> [1] NA           "2000-01-02" NA           NA
create_iso8601(date, .format = "y m d")
#> [1] NA           NA           "2000-01-03" NA
create_iso8601(date, .format = "y/m/d")
#> [1] "2000-01-01" NA           NA           "2000-01-04"
create_iso8601(date, .format = list(c("y-m-d", "y m d", "y/m/d")))
#> [1] "2000-01-01" "2000-01-02" "2000-01-03" "2000-01-04"

Consider the order in which you supply the formats, as it can be significant. If multiple formats could potentially match, the sequence determines which format is applied first.

create_iso8601("07 04 2000", .format = list(c("d m y", "m d y")))
#> [1] "2000-04-07"
create_iso8601("07 04 2000", .format = list(c("m d y", "d m y")))
#> [1] "2000-07-04"

Note that if you are passing alternative formats, then the .format argument must be a list whose length matches the number of inputs.

Parsing of date or time components

By default, date or time components are parsed as follows:

# Years: two-digit or four-digit numbers.
years <- c("0", "1", "00", "01", "15", "30", "50", "68", "69", "80", "99")
create_iso8601(years, .format = "y")
#>  [1] NA     NA     "2000" "2001" "2015" "2030" "2050" "2068" "1969" "1980"
#> [11] "1999"

# Adjust the point where two-digits years are mapped to 2000's or 1900's.
create_iso8601(years, .format = "y", .cutoff_2000 = 20L)
#>  [1] NA     NA     "2000" "2001" "2015" "1930" "1950" "1968" "1969" "1980"
#> [11] "1999"

# Both numeric months (two-digit only) and abbreviated months work out of the box
months <- c("0", "00", "1", "01", "Jan", "jan")
create_iso8601(months, .format = "m")
#> [1] NA     "--00" NA     "--01" "--01" "--01"

# Month days: single or two-digit numbers, anything else results in NA.
create_iso8601(c("1", "01", "001", "10", "20", "31"), .format = "d")
#> [1] "----01" "----01" NA       "----10" "----20" "----31"

# Hours
create_iso8601(c("1", "01", "001", "10", "20", "31"), .format = "H")
#> [1] "-----T01" "-----T01" NA         "-----T10" "-----T20" "-----T31"

# Minutes
create_iso8601(c("1", "01", "001", "10", "20", "60"), .format = "M")
#> [1] "-----T-:01" "-----T-:01" NA           "-----T-:10" "-----T-:20"
#> [6] "-----T-:60"

# Seconds
create_iso8601(c("1", "01", "23.04", "001", "10", "20", "60"), .format = "S")
#> [1] "-----T-:-:01"    "-----T-:-:01"    "-----T-:-:23.04" NA               
#> [5] "-----T-:-:10"    "-----T-:-:20"    "-----T-:-:60"

Allowing alternative date or time values

If date or time component values include special values, e.g. values encoding missing values, then you can indicate those values as possible alternatives such that the parsing will tolerate them; use the .na argument:

create_iso8601("U DEC 2019 14:00", .format = "d m y H:M")
#> [1] NA
create_iso8601("U DEC 2019 14:00", .format = "d m y H:M", .na = "U")
#> [1] "2019-12--T14:00"

create_iso8601("U UNK 2019 14:00", .format = "d m y H:M")
#> [1] NA
create_iso8601("U UNK 2019 14:00", .format = "d m y H:M", .na = c("U", "UNK"))
#> [1] "2019----T14:00"

In this case you could achieve the same result using regexps:

create_iso8601("U UNK 2019 14:00", .format = "(d|U) (m|UNK) y H:M")
#> [1] "2019----T14:00"

Changing reserved format characters

There might be cases when the reserved characters — "y", "m", "d", "H", "M", "S" — might get in the way of specifying an adequate format. For example, you might be tempted to use format "HHMM" to try to parse a time such as "14H00M". You could assume that the first “H” codes for parsing the hour, and the second “H” to be a literal “H” but, actually, "HH" will be taken to mean parsing hours, and "MM" to parse minutes. You can use the function fmt_cmp() to specify alternative format regexps for the format, replacing the default characters.

In the next example, we reassign new format strings for the hour and minute components, thus freeing the "H" and "M" patterns from being interpreted as hours and minutes, and to be taken literally:

create_iso8601("14H00M", .format = "HHMM")
#> [1] NA
create_iso8601("14H00M", .format = "xHwM", .fmt_c = fmt_cmp(hour = "x", min = "w"))
#> [1] "-----T14:00"

Note that you need to make sure that the format component regexps are mutually exclusive, i.e. they don’t have overlapping matches; otherwise create_iso8601() will fail with an error. In the next example both months and minutes could be represented by an "m" in the format resulting in an ambiguous format specification.

fmt_cmp(hour = "h", min = "m")
#> $sec
#> [1] "S+"
#> 
#> $min
#> [1] "m"
#> 
#> $hour
#> [1] "h"
#> 
#> $mday
#> [1] "d+"
#> 
#> $mon
#> [1] "m+"
#> 
#> $year
#> [1] "y+"
#> 
#> attr(,"class")
#> [1] "fmt_c"
try(create_iso8601("14H00M", .format = "hHmM", .fmt_c = fmt_cmp(hour = "h", min = "m")))
#> Error in purrr::map2(dots, .format, ~parse_dttm(dttm = .x, fmt = .y, na = .na,  : 
#>   ℹ In index: 1.
#> Caused by error in `purrr::map()`:
#> ℹ In index: 1.
#> Caused by error in `parse_dttm_fmt()`:
#> ! Patterns in `fmt_c` have overlapping matches.