DQAstats2SHIPdataset

The intention of this vignette is to provide a tutorial, how to prepare the metadata information of a dataset in order to use the DQAstats R package for data quality assessment.

The R package dataquieR provides both a dataset and corresponding metadata information. For demonstration purposes, the SHIP-dataset from dataquieR will be analyzed with DQAstats in the following. The necessary steps to convert dataquieR’s metadata representation into the format understood by DQAstats are provided along with some background information.

Load Data from dataquieR R package

First of all, load the data files from the dataquieR R package. These files include the actual dataset to be analyzed ship.RDS and the corresponding metadata representation ship_meta.RDS. For further information on dataquieR, please refer to https://dataquality.ship-med.uni-greifswald.de/.

# the next lines are commentend, since we avoided to have 'dataquieR' as yet
# another dependency to our package; however, if you have installed it, you
# can uncomment these lines to load the required files from the dataquieR-package

#list.files(system.file("extdata", package = "dataquieR"))

# load the ship dataset
# ship_data <- readRDS(
#   system.file("extdata/ship.RDS", package = "dataquieR")
# ) %>%
#   data.table::data.table()
ship_fn <- file.path(tempdir(), "ship.RDS")
ship_url <- paste0(
  "https://raw.githubusercontent.com/cran/dataquieR/",
  "master/inst/extdata/ship.RDS"
)
download.file(
  url = ship_url,
  destfile = ship_fn
)
Sys.sleep(1)
ship_data <- readRDS(ship_fn) %>%
  data.table::data.table()

# export the dataset to a CSV-file (one of the formats supported by DQAstats)
ship_data_export_fn <- "ship_data.csv"
data.table::fwrite(
  x = ship_data,
  file = file.path(tempdir(), ship_data_export_fn)
)
# load the ship metadata
# ship_meta <- readRDS(
#   system.file("extdata/ship_meta.RDS", package = "dataquieR")
# ) %>%
#   data.table::data.table()
ship_meta_fn <- file.path(tempdir(), "ship_meta.RDS")
ship_meta_url <- paste0(
  "https://raw.githubusercontent.com/cran/dataquieR/",
  "master/inst/extdata/ship_meta.RDS"
)
download.file(
  url = ship_meta_url,
  destfile = ship_meta_fn
)
Sys.sleep(1)
ship_meta <- readRDS(ship_meta_fn) %>%
  data.table::data.table()

Have a look at dataquieR’s metadata representation:

ship_meta %>%
  DT::datatable(options = list(
    scrollX = TRUE,
    pageLength = 4
  ))

Prepare DQAstats MDR

The most important part for using DQAstats for data quality assessment is to represent the metadata of the dataset(s) to be analyzed in the correct format. The MDR format expected by DQAstats is described in the DQAstats-wiki.

An example MDR is also provided with the DQAstats R package here. Furthermore, DQAstats comes with an exemplary toy dataset, which can be analyzed by executing the commands from the README.md.

Read empty DQAstats MDR as template

In order to get an empty template of the DQAstats’ MDR to be filled with the metadata information from dataquieR, we will just read the columns names from the example MDR:

mdr <- data.table::fread(
  file = system.file(
    "demo_data/utilities/MDR/mdr_example_data.csv",
    package = "DQAstats"
  ),
  header = TRUE,
  nrows = 0,
  colClasses = "character"
)
dim(mdr)
#> [1]  0 15
colnames(mdr)
#>  [1] "designation"           "source_variable_name"  "filter"               
#>  [4] "source_table_name"     "source_system_name"    "source_system_type"   
#>  [7] "key"                   "variable_name"         "variable_type"        
#> [10] "restricting_date_var"  "constraints"           "dqa_assessment"       
#> [13] "definition"            "data_map"              "plausibility_relation"

For a detailed description of the MDR fields, please refer to the DQAstats-wiki.

Transform SHIP-MDR to DQAstats representation

Next, we need to fill the rows of the DQAstats MDR. Since we are only analyzing one dataset (opposed to comparing datasets from two different databases, whereas the dataset was transferred from one of the databases to the other, e.g., via an extract-transfrom-load (ETL) process), we can set the MDR files “designation”, “variable_name”, and “key” to the same values.

# get names of dataelements from ship dataset
ship_var_names <- colnames(ship_data)

# loop over dataelements
for (var in ship_var_names) {
  # get variable type from ship metadata
  data_type <- ship_meta[get("VAR_NAMES") == var, get("DATA_TYPE")]
  # get variable description from ship metadata
  labels <- ship_meta[get("VAR_NAMES") == var, get("VALUE_LABELS")]
  split_list <- strsplit(
    x = labels,
    split = " | ",
    fixed = TRUE
  )
  # convert variable description into human readable format
  if (length(split_list[[1]]) == 1 && is.na(split_list[[1]])) {
    descr <- paste0("Description for dataelement '", var, "'")
  } else {
    descr <- paste0(
      "Description for dataelement '", var,
      "': \n\n", paste0(unlist(split_list), collapse = "; ")
    )
  }
  
  # for this dataelement, fill row of DQAstats-MDR with extracted information
  mdr <- data.table::rbindlist(
    l = list(
      mdr,
      data.table::as.data.table(
        x = cbind(
          "designation" = var,
          "definition" = descr,
          "source_variable_name" = var,
          "variable_name" = var,
          "key" = var,
          "source_table_name" = ship_data_export_fn,
          "source_system_type" = "csv", # indicates that datasets format is CSV
          "source_system_name" = "ship", # arbitrary name for this dataset
          "dqa_assessment" = 1, # default value for dataelements to be analyzed
          "variable_type" = data_type
        )
      )
    ),
    fill = TRUE
  )
}

Change Variable Type for Categorical Variables

In dataquieR, categorical variables are defined as “DATA_TYPE=integer” in the MDR. To work well with DQAstats, this needs to be changed to “enumerated” (see also the DQAstats-wiki for further details).

# save all categorical variablees in a vector
cat_vars <- c(
  "id", "sex", "obs_bp", "dev_bp", "obs_soma", "dev_length", "dev_weight",
  "obs_int", "school", "family", "smoking", "stroke", "myocard",
  "diab_known", "contraception", "income"
)

# change variable type from integer to "enumerated" (to get meaningful
# results from DQAstats)
mdr[get("designation") %in% cat_vars, ("variable_type") := "enumerated"]

Define Constraints

Next, the formatting of the constraints defined in dataquieR’s MDR needs to be transformed to the formatting, required by DQAstats. Please find further details on the definition of constraints in the DQAstats-wiki.

Constraints for Categorical Variables

# loop over categorical variables
for (var in cat_vars) {
  # get definitions of category-levels
  labels <- ship_meta[get("VAR_NAMES") == var, get("VALUE_LABELS")]
  split_list <- strsplit(
    x = labels,
    split = " | ",
    fixed = TRUE
  )
  
  # extract the allowed values for each categorical dataelement
  allowed_values <- strsplit(
    x = split_list[[1]],
    split = " =",
    fixed = TRUE
  )
  
  # transform allowed values into a "value_set" array, which is stored
  # as a JSON object (required by DQAstats)
  c <- jsonlite::toJSON(
    list(
      "value_set" = sapply(
          X = allowed_values,
          FUN = function(x) {
            return(x[[1]])
          })
    ),
    pretty = TRUE
  )
  # add constraints for this dataelement to the DQAstats MDR
  mdr[get("source_variable_name") == var, ("constraints") := c]
}
# for comparison, show metadata definition from dataquieR ...
# dataquieR: string, split by pipes '|'
ship_meta[get("VAR_NAMES") == "smoking", get("VALUE_LABELS")]
#> [1] "0 = nonsmoker | 1 = former smoker | 2 = smoker"
# ... and DQAstats: JSON
mdr[get("source_variable_name") == "smoking", get("constraints")]
#> [1] "{\n  \"value_set\": [\"0\", \"1\", \"2\"]\n}"
mdr[get("source_variable_name") == "smoking", get("constraints")] %>%
  jsonlite::fromJSON()
#> $value_set
#> [1] "0" "1" "2"

We can see here that the mapping from the data codes to the respective values gets lost during the transformation of constraints from dataquieR to DQAstats. This is because DQAstats is meant to be applied directly to the raw dataset and checks the conforming of the observed data values to the respective constraints for each variable.

Furthermore, it is to be noted that DQAstats does not require constraints to be defined for each data element in order to be applied to a dataset. Thus, data quality assessment with DQAstats can also be performed without the definition of constraints.

Constraints for Continuous Variables

# get names of continuous variables (float/integer)
cont_vars <- mdr[
  get("variable_type") %in% c("integer", "float"),
  unique(get("source_variable_name"))
]

# loop over continuous dataelements
for (var in cont_vars) {
  # extract constraints of continuous dataelement
  labels <- ship_meta[get("VAR_NAMES") == var, get("HARD_LIMITS")]
  split_list <- strsplit(
    x = labels,
    split = ";",
    fixed = TRUE
  )

  # transform these constraints to a list, with entries "min" and "max"
  constraints <- lapply(
    X = split_list,
    FUN = function(x) {

      if (!is.na(x[[1]])) {
        return(
          list(
            "min" = as.numeric(gsub("[[:punct:]]", "", x[[1]])),
            "max" = as.numeric(gsub("[[:punct:]]", "", x[[2]]))
          )
        )
      } else {
        return("error")
      }
    }
  )

  # save list as nested JSON within the key "range" (required by DQAstats)
  if (length(constraints[[1]]) > 1) {
    c <- jsonlite::toJSON(
      list("range" = constraints[[1]]),
      pretty = TRUE,
      auto_unbox = TRUE
    )
    # write constraint for this dataelement to DQAstats MDR
    mdr[get("source_variable_name") == var, ("constraints") := c]
  }
}

Constraints for String Variables

When setting “id” as variable_type="string", we are able to test for constraints using regular expressions. This is especially meaningful, when investigating a categorical variable with a lot of categories. Here, a simple regular expression is used, however, they can get arbitrarily complex.

# The `id` variable should actually not be treated as a categorical variable,
# since the number of potential categories is (theoretically) unlimited and
# an analysis of the categories in this case is seldom meaningful.
# Nevertheless, we could perform quality checks on the `id` variable, e.g. to
# check for a required formatting of the values.
# This can be achieved with DQAstats by changing the variable type to "string"
# and use a regular expression that checks for the expected format.

# change variable type to "string"
mdr[get("source_variable_name") == "id", ("variable_type") := "string"]

# add regular expression for the dataelement "id"
mdr[get("source_variable_name") == "id", ("constraints") := jsonlite::toJSON(
  list("regex" = "^[[:digit:]]{1,5}$"),
  pretty = TRUE,
  auto_unbox = TRUE
)]

Add Plausibility Checks

Currently, there is no mechanism to automatically transform the plausibility checks (or “contradictions” as named by dataquieR) from the dataquieR format to DQAstats. The following checks were manually taken from the file ship_contradiction_checks.csv and transformed into the representation required by DQAstats.

Please find further details on the definition of plausibility checks in the DQAstats-wiki.

Contraception in males (atemporal plausibility)

# add plausibilities
# sex and contraception
p <- jsonlite::toJSON(list(
  "atemporal" = list(
    "contraception" = list(
      "name" = "A_present_and_B_vv",
      "description" = "Contracept in males.",
      "join_crit" = "id",
      "filter" = list(
        "ship" = "^1$"
      ),
      "constraints" = list(
        "value_set" = list(
          "ship" = "1"
        )
      )
    )
  )
),
  pretty = TRUE,
  auto_unbox = TRUE
)

# write plausibility to mdr
mdr[
  source_variable_name == "sex" &
    dqa_assessment == 1,
  plausibility_relation := p
]

Diabetes age but no Diabetes (atemporal plausibility)

p <- jsonlite::toJSON(list(
  "atemporal" = list(
    "diab_age" = list(
      "name" = "A_present_and_B_levels_vl",
      "description" = "Diab age but no diab.",
      "join_crit" = "id",
      "filter" = list(
        "ship" = "^(?!.*(99900|99901|99801)).*$"
      ),
      "constraints" = list(
        "value_set" = list(
          "ship" = "1"
        )
      )
    )
  )),
  pretty = TRUE,
  auto_unbox = TRUE
)
mdr[
  source_variable_name == "diab_known" &
    dqa_assessment == 1,
  plausibility_relation := p
]

Every ID is associated with one Sex-value (uniqueness plausibility)

p <- jsonlite::toJSON(list(
  "uniqueness" = list(
    "sex" = list(
      "name" =  "Every ID is associated with one Sex",
      "description" = paste0("With each distinct value of 'id', ",
      "only one value of 'sex' may be associated."
    )
  ))),
  pretty = TRUE,
  auto_unbox = TRUE
)
mdr[
  source_variable_name == "id" &
    dqa_assessment == 1,
  plausibility_relation := p
]

Display the prepared MDR

Now, have a look at the MDR prepared for DQAstats:

mdr %>%
  DT::datatable(options = list(
    scrollX = TRUE,
    pageLength = 4
  ))

Create utils-folder

DQAstats requires a so-called utilities-folder that contains at least two sub-directories, named MDR and RMD. Please find further details on this folder in the DQAstats-wiki.

When having created this folder structure, the MDR prepared for DQAstats can be saved as ‘CSV’-file to the MDR-sub-directory. This file is then used as input for the function DQAstats::dqa().

utils_path <- file.path(tempdir(), "utilities")
dir.create(file.path(utils_path, "MDR"), recursive = TRUE)

data.table::fwrite(
  x = mdr,
  file = file.path(utils_path, "MDR/mdr.csv")
)

The RMD sub-directory of the utilities-folder contains the template files for the Rmarkdown report. For demonstration purposes, the templates provided with DQAstats can be used directly.

file.copy(
  from = system.file("demo_data/utilities/RMD", package = "DQAstats"),
  to = utils_path,
  overwrite = TRUE,
  recursive = TRUE
)

Run DQAstats::dqa()

Since we are here not aiming to compare two datasets, we are going to set the target_system_name to source_system_name. Thus we will get the analysis of the database compared with itself.

Furthermore, the path to the utilities-folder, the logging directory, the MDR-filename, as well as the directory to store the resulting PDF-report (output_dir) need to be defined and provided to the function DQAstats::dqa().

Additionally, the directory that contains the dataset(s) to be analyzed must be specified using an environment variable. This variable is composed of the dataset-name (as specified in the mdr.csv field source_table_name) and the suffix _PATH, here SHIP_PATH.

# ship data set
source_system_name <- "ship"
target_system_name <- source_system_name

mdr_filename <- "mdr.csv"
output_dir <- file.path(tempdir(), "output")
logfile_dir = tempdir()

# does only work, if "ship_data.csv" is lying next to this RMD-file
Sys.setenv("SHIP_PATH" = tempdir())

# provide all arguments to main function
all_results <- DQAstats::dqa(
  source_system_name = source_system_name,
  target_system_name = target_system_name,
  utils_path = utils_path,
  mdr_filename = mdr_filename,
  output_dir = output_dir,
  logfile_dir = logfile_dir,
  parallel = FALSE
)

Launch DQAgui as GUI-frontend to DQAstats

When executing the following command, the GUI will be launched at port 3838 by default and can be accessed from any web-browser via http://localhost:3838.

A tutorial for DQAgui can be found here.

library(DQAgui)
# set basepath for file-browser in GUI-config
Sys.setenv("CSV_SOURCE_BASEPATH" = tempdir())

## launch GUI
DQAgui::launch_app(
  utils_path = utils_path,
  mdr_filename = "mdr.csv",
  parallel = FALSE
)
# nolint end