Class dm and basic operations

2020-07-02

The goal of the {dm} package and the dm class that comes with it, is to make your life easier when you are dealing with data from several different tables.

Let’s take a look at the dm class.

Class dm

The dm class consists of a collection of tables and metadata about the tables, such as

All tables in a dm must be obtained from the same data source; csv files and spreadsheets would need to be imported to data frames in R.

Examples of dm objects

The relevant functions for creating dm objects are:

  1. dm()
  2. as_dm()
  3. new_dm()

There are currently three options available for creating a dm object. To illustrate these options, we will now create the same dm in three different ways. We can use the tables from the well-known {nycflights13} package.

Start with an empty dm

Start with an empty dm object that has been created with dm() or new_dm(), and add tables to that object:

library(nycflights13)
library(dm)
empty_dm <- dm()
empty_dm
#> dm()
dm_add_tbl(empty_dm, airlines, airports, flights, planes, weather) 
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

Coerce a list of tables

Turn a named list of tables into a dm with as_dm():

as_dm(list(airlines = airlines, 
           airports = airports, 
           flights = flights, 
           planes = planes, 
           weather = weather))
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

Turn tables from a src into a dm

Squeeze all (or a subset of) tables belonging to a src object into a dm using dm():

library(dplyr)
flights_dm <- dm_from_src(src_df(pkg = "nycflights13"))
#> Warning: `src_local()` is deprecated as of dplyr 1.0.0.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
flights_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0

Here we are making use of the fact that the function dm(src, table_names = NULL) includes all available tables on a source in the dm object, should the table_names-argument be left NULL. This means that you can use this, for example, on a postgres database that you access via src_postgres() (with the appropriate arguments dbname, host, port, …), to produce a dm object with all the tables on the database.

Another way of creating a dm object is calling new_dm() on a list of tbls, e.g., the tables included in the well-known iris dataset:

iris_dm <- new_dm(list("iris1" = iris, "iris2" = iris))
iris_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `iris1`, `iris2`
#> Columns: 10
#> Primary keys: 0
#> Foreign keys: 0

We can get the list of tables with dm_get_tables() and the src object with dm_get_src().

In order to pull a specific table from a dm, use:

tbl(flights_dm, "airports")
#> # A tibble: 1,458 x 8
#>    faa   name                    lat    lon   alt    tz dst   tzone        
#>    <chr> <chr>                 <dbl>  <dbl> <dbl> <dbl> <chr> <chr>        
#>  1 04G   Lansdowne Airport      41.1  -80.6  1044    -5 A     America/New_…
#>  2 06A   Moton Field Municipa…  32.5  -85.7   264    -6 A     America/Chic…
#>  3 06C   Schaumburg Regional    42.0  -88.1   801    -6 A     America/Chic…
#>  4 06N   Randall Airport        41.4  -74.4   523    -5 A     America/New_…
#>  5 09J   Jekyll Island Airport  31.1  -81.4    11    -5 A     America/New_…
#>  6 0A9   Elizabethton Municip…  36.4  -82.2  1593    -5 A     America/New_…
#>  7 0G6   Williams County Airp…  41.5  -84.5   730    -5 A     America/New_…
#>  8 0G7   Finger Lakes Regiona…  42.9  -76.8   492    -5 A     America/New_…
#>  9 0P2   Shoestring Aviation …  39.8  -76.6  1000    -5 U     America/New_…
#> 10 0S9   Jefferson County Intl  48.1 -123.    108    -8 A     America/Los_…
#> # … with 1,448 more rows

But how can we use {dm}-functions to manage the primary keys of the tables in a dm object?

Primary keys of dm objects

Some useful functions for managing primary key settings are:

  1. dm_add_pk()
  2. dm_has_pk()
  3. dm_get_pk()
  4. dm_rm_pk()
  5. dm_enum_pk_candidates()
  6. dm_get_all_pks()

Currently dm objects only support one-column primary keys. If your tables have unique compound keys, adding a surrogate key column might be helpful. If you created a dm object according to the examples in “Examples of dm objects”, your object does not yet have any primary keys set. So let’s add one.

dm_add_pk() has an option to check if the column of the table given by the user is a unique key; for performance reasons, the check will not be executed unless requested. Since the iris dataset does not have any unique one-column keys, we will use the nycflights13 tables, i.e. flights_dm from above.

dm_has_pk(flights_dm, airports)
#> [1] FALSE
flights_dm_with_key <- dm_add_pk(flights_dm, airports, faa)
flights_dm_with_key
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 1
#> Foreign keys: 0

The dm now has a primary key. Let’s check:

dm_has_pk(flights_dm_with_key, airports)
#> [1] TRUE

Get the name of the column that is marked as primary key of the table:

dm_get_pk(flights_dm_with_key, airports)
#> <list_of<character>[1]>
#> [[1]]
#> [1] "faa"

Remove a primary key:

dm_rm_pk(flights_dm_with_key, airports) %>% 
  dm_has_pk(airports)
#> [1] FALSE

If you still need to get to know your data better, and it is already available in the form of a dm object, you can use the dm_enum_pk_candidates() function in order to get information about which columns of the table are unique keys:

dm_enum_pk_candidates(flights_dm_with_key, airports)
#> # A tibble: 8 x 3
#>   columns candidate why                                                    
#>   <keys>  <lgl>     <chr>                                                  
#> 1 faa     TRUE      ""                                                     
#> 2 lon     TRUE      ""                                                     
#> 3 alt     FALSE     "has duplicate values: 0, 1, 3, 4, 5, …"               
#> 4 dst     FALSE     "has duplicate values: A, N, U"                        
#> 5 lat     FALSE     "has duplicate values: 38.88944, 40.63975"             
#> 6 name    FALSE     "has duplicate values: All Airports, Capital City Airp…
#> 7 tz      FALSE     "has duplicate values: -10, -9, -8, -7, -6, …"         
#> 8 tzone   FALSE     "has duplicate values: America/Anchorage, America/Chic…

The flights table does not have any one-column primary key candidates:

dm_enum_pk_candidates(flights_dm_with_key, flights) %>% count(candidate)
#> # A tibble: 1 x 2
#>   candidate     n
#>   <lgl>     <int>
#> 1 FALSE        19

To get an overview over all tables with primary keys, use dm_get_all_pks():

dm_get_all_pks(dm_nycflights13(cycle = TRUE))
#> # A tibble: 3 x 2
#>   table    pk_col 
#>   <chr>    <keys> 
#> 1 airlines carrier
#> 2 airports faa    
#> 3 planes   tailnum

Here we used the prepared dm object dm_nycflights13(cycle = TRUE) as an example. This object already has all keys pre-set.

Foreign keys

Useful functions for managing foreign key relations include:

  1. dm_add_fk()
  2. dm_has_fk()
  3. dm_get_fk()
  4. dm_rm_fk()
  5. dm_enum_fk_candidates()
  6. dm_get_all_fks()

Now it gets (even more) interesting: we want to define relations between different tables. With the dm_add_fk() function you can define which column of which table points to another table’s column.

This is done by choosing a foreign key from one table that will point to a primary key of another table. The primary key of the referred table must be set with dm_add_pk(). dm_add_fk() will find the primary key column of the referenced table by itself and make the indicated column of the child table point to it.

flights_dm_with_key %>% dm_add_fk(flights, origin, airports)
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  <environment: R_GlobalEnv>
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 1
#> Foreign keys: 1

This will throw an error:

flights_dm %>% dm_add_fk(flights, origin, airports)
#> Error: ref_table `airports` needs a primary key first. Use `dm_enum_pk_candidates()` to find appropriate columns and `dm_add_pk()` to define a primary key.

Let’s create a dm object with a foreign key relation to work with later on:

flights_dm_with_fk <- dm_add_fk(flights_dm_with_key, flights, origin, airports)

What if we tried to add another foreign key relation from flights to airports to the object? Column dest might work, since it also contains airport codes:

flights_dm_with_fk %>% dm_add_fk(flights, dest, airports, check = TRUE)
#> Error: Column `dest` of table `flights` contains values (see above) that are not present in column `faa` of table `airports`.

As you can see, behind the scenes, checks are executed automatically (unless check = FALSE) by the functions of dm to prevent steps that would result in inconsistent representations.

Use dm_has_fk() for checking if a foreign key exists that is pointing from one table to another:

flights_dm_with_fk %>% dm_has_fk(flights, planes)
#> [1] FALSE
flights_dm_with_fk %>% dm_has_fk(flights, airports)
#> [1] TRUE

If you want to access the name of the column which acts as a foreign key of one table to another table’s column, use dm_get_fk():

flights_dm_with_fk %>% dm_get_fk(flights, planes)
#> <list_of<character>[0]>
flights_dm_with_fk %>% dm_get_fk(flights, airports)
#> <list_of<character>[1]>
#> [[1]]
#> [1] "origin"

Remove foreign key relations with dm_rm_fk() (parameter column = NULL means that all relations will be removed):

flights_dm_with_fk %>% 
  dm_rm_fk(table = flights, column = dest, ref_table = airports) %>% 
  dm_get_fk(flights, airports)
#> Error: (`dest`) is not a foreign key of table `flights` into table `airports`.
flights_dm_with_fk %>% 
  dm_rm_fk(flights, origin, airports) %>% 
  dm_get_fk(flights, airports)
#> <list_of<character>[0]>
flights_dm_with_fk %>% 
  dm_rm_fk(flights, NULL, airports) %>% 
  dm_get_fk(flights, airports)
#> <list_of<character>[0]>

Since the primary keys are defined in the dm object, you do not need to provide the referenced column name of ref_table. This is always the primary key column of the table.

Another function for getting to know your data better (cf. dm_enum_pk_candidates() in “Primary keys of dm objects”) is dm_enum_fk_candidates(). Use it to get an overview over foreign key candidates that point from one table to another:

dm_enum_fk_candidates(flights_dm_with_key, weather, airports)
#> # A tibble: 15 x 3
#>    columns    candidate why                                                
#>    <keys>     <lgl>     <chr>                                              
#>  1 origin     TRUE      ""                                                 
#>  2 day        FALSE     "Can't join on `x$value` x `y$value` because of in…
#>  3 dewp       FALSE     "Can't join on `x$value` x `y$value` because of in…
#>  4 hour       FALSE     "Can't join on `x$value` x `y$value` because of in…
#>  5 humid      FALSE     "Can't join on `x$value` x `y$value` because of in…
#>  6 month      FALSE     "Can't join on `x$value` x `y$value` because of in…
#>  7 precip     FALSE     "Can't join on `x$value` x `y$value` because of in…
#>  8 pressure   FALSE     "Can't join on `x$value` x `y$value` because of in…
#>  9 temp       FALSE     "Can't join on `x$value` x `y$value` because of in…
#> 10 time_hour  FALSE     "Can't join on `x$value` x `y$value` because of in…
#> 11 visib      FALSE     "Can't join on `x$value` x `y$value` because of in…
#> 12 wind_dir   FALSE     "Can't join on `x$value` x `y$value` because of in…
#> 13 wind_gust  FALSE     "Can't join on `x$value` x `y$value` because of in…
#> 14 wind_speed FALSE     "Can't join on `x$value` x `y$value` because of in…
#> 15 year       FALSE     "Can't join on `x$value` x `y$value` because of in…

Get an overview of all foreign key relations withdm_get_all_fks():

dm_get_all_fks(dm_nycflights13(cycle = TRUE))
#> # A tibble: 4 x 3
#>   child_table child_fk_cols parent_table
#>   <chr>       <keys>        <chr>       
#> 1 flights     carrier       airlines    
#> 2 flights     dest          airports    
#> 3 flights     origin        airports    
#> 4 flights     tailnum       planes