Model verification - keys, constraints and normalization

2020-07-02

In this document we will present several specialized functions for conducting basic tests about key conditions and about relations between tables. We will also describe functions that can be used for splitting and uniting tables.

library(tidyverse)
library(dm)

Testing key constraints

This section contains information and examples about the following functions:

  1. check_key(.data, ...)
  2. check_subset(t1, c1, t2, c2)
  3. check_set_equality(t1, c1, t2, c2)

When you have tables (data frames) that are connected by key relations, this package can help you to verify the assumed key relations and/or determine the existing key relations between the tables. For example, if you have tables:

data_1 <- tibble(a = c(1, 2, 1), b = c(1, 4, 1), c = c(5, 6, 7))
data_2 <- tibble(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))

and you want to know if a is a primary key for data_1, you can use the check_key() function:

check_key(data_1, a)
#> Error: (`a`) not a unique key of `data_1`.

Mind the error message when a test is not passed.

For data_2, column a is a key:

check_key(data_2, a)

To see if a column of one table contains only those values that are also present in another column of another table, the check_subset() function can be used:

check_subset(data_1, a, data_2, a)

This function is important for determining if a column is a foreign key to some other table. What about the inverse relation?

check_subset(data_2, a, data_1, a)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     3     6     9
#> Error: Column `a` of table `data_2` contains values (see above) that are not present in column `a` of table `data_1`.

It should be kept in mind that check_subset() does not test if column c2 is a unique key of table t2. In order to find out if a (child) table t1 contains a column c1 that is a foreign key to a (parent) table t2 with the corresponding column c2, the following method should be used:

check_key(t2, c2)
check_subset(t1, c1, t2, c2)

To check both directions at once, and to find out if the unique values of c_1 in t_1 are the same as those of c_2 in t_2, {dm} provides the function check_set_equality():

check_set_equality(data_1, a, data_2, a)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     3     6     9
#> Error: Column `a` of table `data_2` contains values (see above) that are not present in column `a` of table `data_1`..

Introducing one more table enables us to show how it looks when the test is passed:

data_3 <- tibble(a = c(2, 1, 2), b = c(4, 5, 6), c = c(7, 8, 9))

check_set_equality(data_1, a, data_3, a)

If the test is passed, the return value of the function will be the first table parameter (invisibly). This ensures that the functions can be conveniently used in a pipe configuration.

Testing cardinalities between two tables

This section contains information and examples for the functions

  1. check_cardinality_0_n(parent_table, primary_key_column, child_table, foreign_key_column)
  2. check_cardinality_1_n(parent_table, primary_key_column, child_table, foreign_key_column)
  3. check_cardinality_0_1(parent_table, primary_key_column, child_table, foreign_key_column)
  4. check_cardinality_1_1(parent_table, primary_key_column, child_table, foreign_key_column)
  5. examine_cardinality(parent_table, primary_key_column, child_table, foreign_key_column)

The four functions for testing for a specific kind of cardinality of the relation all require a parent table and a child table as inputs. The functions first test if that requirement is fulfilled by checking if:

  1. primary_key_column is a unique key for parent_table
  2. The set of values of foreign_key_column is a subset of the set of values of primary_key_column

The cardinality specifications 0_n, 1_n, 0_1, 1_1 refer to the expected relation that the child table has with the parent table. The numbers ‘0’, ‘1’ and ‘n’ refer to the number of values in the child table’s column (foreign_key_column) that correspond to each value of the parent table’s column (primary_key_column). ‘n’ means more than one in this context, with no upper limit.

0_n means, that for each value of the parent_key_column, the number of corresponding records in the child table is unrestricted. 1_n means, that for each value of the parent_key_column there is at least one corresponding record in the child table. This means that there is a “surjective” relation from the child table to the parent table w.r.t. the specified columns, i.e. for each parent table column value there exists at least one equal child table column value.

0_1 means, that for each value of the parent_key_column, at least zero and at most one value has to correspond to it in the column of the child table. This means that there is an “injective” relation from the child table to the parent table w.r.t. the specified columns, i.e. no parent table column value is addressed multiple times. But not all of the parent table column values have to be referred to.

1_1 means, that for each value of the parent_key_column, exactly one value has to correspond to it in the child table’s column. This means that there is a “bijective” (“injective” AND “surjective”) relation between the child table and the parent table w.r.t. the specified columns, i.e. the set of values of the two columns is equal and there are no duplicates in either of them.

Also examine_cardinality() first performs the above mentioned tests to figure out, if the parent-child table relationship criteria are met. Subsequently, two further checks are made to determine the nature of the relation (surjective, injective, bijective or none of these) between the two columns.

Examples

Given the following three data frames:

d1 <- tibble(a = 1:5)
d2 <- tibble(c = c(1:5,5))
d3 <- tibble(c = 1:4)
d4 <- tibble(a = c(2:5, 5))

Here are some examples of how the cardinality testing functions can be used:

# This does not pass, `c` is not unique key of d2:
check_cardinality_0_n(d2, c, d1, a)
#> Error: (`c`) not a unique key of `d2`.

# This passes, multiple values in d2$c are allowed:
check_cardinality_0_n(d1, a, d2, c)

# This does not pass, injectivity is violated:
check_cardinality_1_1(d1, a, d2, c)
#> Error: 1..1 cardinality (bijectivity) is not given: Column `c` in table `d2` contains duplicate values.

# This passes:
check_cardinality_0_1(d1, a, d3, c)

examine_cardinality() returns the type of relation, e.g.:

examine_cardinality(d1, a, d3, c)
#> [1] "injective mapping (child: 0 or 1 -> parent: 1)"
examine_cardinality(d1, a, d2, c)
#> [1] "surjective mapping (child: 1 to n -> parent: 1)"
examine_cardinality(d1, a, d1, a)
#> [1] "bijective mapping (child: 1 -> parent: 1)"
examine_cardinality(d1, a, d4, a)
#> [1] "generic mapping (child: 0 to n -> parent: 1)"

Table surgery

The relevant functions are:

  1. decompose_table(.data, new_id_column, ...)
  2. reunite_parent_child(child_table, parent_table, id_column)
  3. reunite_parent_child_from_list(list_of_parent_child_tables, id_column)

The first function implements table normalization. An existing table is split into a parent table (i.e. a lookup table) and a child table (containing the observations), linked by a key column (here: new_id_column). Basically, a foreign key relation would be created, pointing from the new_id_column of the child table to the parent table’s corresponding column, which can be seen as the parent table’s primary key column. The function decompose_table() does that, as can be seen in the following example:

mtcars_tibble <- as_tibble(mtcars)
mtcars_tibble
#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows
decomposed_table <- decompose_table(mtcars_tibble, am_gear_carb_id, am, gear, carb)
decomposed_table
#> $child_table
#> # A tibble: 32 x 9
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs am_gear_carb_id
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>           <int>
#>  1  21       6  160    110  3.9   2.62  16.5     0               9
#>  2  21       6  160    110  3.9   2.88  17.0     0               9
#>  3  22.8     4  108     93  3.85  2.32  18.6     1               7
#>  4  21.4     6  258    110  3.08  3.22  19.4     1               1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0               2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1               1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0               4
#>  8  24.4     4  147.    62  3.69  3.19  20       1               5
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1               5
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1               6
#> # … with 22 more rows
#> 
#> $parent_table
#> # A tibble: 13 x 4
#>    am_gear_carb_id    am  gear  carb
#>              <int> <dbl> <dbl> <dbl>
#>  1               1     0     3     1
#>  2               2     0     3     2
#>  3               3     0     3     3
#>  4               4     0     3     4
#>  5               5     0     4     2
#>  6               6     0     4     4
#>  7               7     1     4     1
#>  8               8     1     4     2
#>  9               9     1     4     4
#> 10              10     1     5     2
#> 11              11     1     5     4
#> 12              12     1     5     6
#> 13              13     1     5     8

A new column is created, with which the two tables can be joined again, essentially creating the original table.

The functions that do the inverse operation, i.e. join a parent and a child table and subsequently drop the new_id_column, are reunite_parent_child() and reunite_parent_child_from_list(). The former takes as arguments two tables and the unquoted name of the ID column, and the latter takes as arguments a list of two tables plus the unquoted name of the ID column:

parent_table <- decomposed_table$parent_table
child_table <- decomposed_table$child_table
reunite_parent_child(child_table, parent_table, id_column = am_gear_carb_id)
#> # A tibble: 32 x 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows
# Shortcut:
reunite_parent_child_from_list(decomposed_table, id_column = am_gear_carb_id)

Currently these functions only exist as a low-level operation on tables. We plan to extend this operation to dm objects in the future.