Sourcing structure from database

Krystian Igras

2023-02-10

When pulling schema structure directly from database, you may decide which schema information should be saved in the configuration yaml file. The proper configuration defined with set_faker_opts should be passed to faker_opts parameters of schema_source function:

schema <- source_schema(
  source = conn,
  schema = "public",
  faker_opts = set_faker_opts(...)
)

DataFakeR currently offers two configuration types:

Column specific configuration

The current version of DataFakeR package supports five types (R target types) of columns:

Each column-type configuration is done by setting:

set_faker_opts(opt_pull_<type> = opt_pull_<type>(...))

The possible configurable parameters are (with supported types):

The information stored by the above parameters may then be used in the simulation methods.

The default parameters can be accessed respectively from default_faker_opts object, for example:

character columns:

default_faker_opts$opt_pull_character
#> $values
#> [1] TRUE
#> 
#> $max_uniq_to_pull
#> [1] 10
#> 
#> $nchar
#> [1] TRUE
#> 
#> $na_ratio
#> [1] TRUE
#> 
#> $levels_ratio
#> [1] TRUE

means, by default we save in the existing column values only when number of its unique values is less than 10. We will be also storing maximum number of character for strings in column.

integer columns:

default_faker_opts$opt_pull_integer
#> $values
#> [1] TRUE
#> 
#> $max_uniq_to_pull
#> [1] 10
#> 
#> $range
#> [1] TRUE
#> 
#> $na_ratio
#> [1] TRUE
#> 
#> $levels_ratio
#> [1] FALSE

means the same for sourcing possible values as for character type, more to that we will source the column values range.

Such configuration for sample book authors table, may result with the below structure:

ID Author Digest
1 Miss Madelyn Crist MD Digest A
2 Merritt Gislason IV Digest A
3 Linton Botsford Digest A
4 Isam Bins-Shanahan Digest A
5 Ora Stark Digest A
6 Priscila Auer Digest A
7 Ms. Addie Grady DDS Digest B
8 Dr. Wayman Halvorson V Digest B
9 Kesha Legros Digest B
10 Gay Hoppe Digest B
11 Yolanda Greenholt Digest B
authors:
  columns:
    ID:
      type: serial
      unique: true
      not_null: true
      default: na.integer
      range: [1, 11]
    author:
      type: varchar
      unique: true
      not_null: true
      default: na.character
      nchar: 23
    digest:
      type: varchar
      unique: false
      not_null: true
      default: na.character
      values: [Digest A, Digest B]
      nchar: 8

If we want to not source range and nchar information just precise:

my_opts <- set_faker_opts(
  opt_pull_integer = opt_pull_integer(range = FALSE),
  opt_pull_character = opt_pull_character(nchar = FALSE)
)

and pass my_opts to faker_opts parameter of schema_source function.

Table specific configuration

Can be achieved by specifying opt_pull_table option with the method of the same name.

In the current version of DataFakeR package only one parameter (nrows) can be configured, with the three values:

Such information can be further used to define number of rows in simulated table (see simulation options).

Note: In the future DataFakeR releases the option to define custom parameters will be enabled.