Introduction

Across a wide variety of statistical techniques and machine learning algorithms, R’s formula object provides a standardized process for specifying the outcomes and inputs to be utilized when a method is applied to a data set. In typical examples, e.g. R’s help file for formula objects, a model is specified in a manual way with a formula such as y ~ a + b + c. For parsimonious models specified by a programmer, a manual selection and entry can be sufficient. However, a variety of applications can present more challenging circumstances in which manual specification may not be an effective strategy. Dynamically generated models may be specified by the user of a graphical interface (e.g. with R’s shiny package). In this case, a programmatic means of specifying a formula based on the user’s selections would be necessary. Even in manual settings, formula objects would benefit from additional quality checks that ensure that the model’s specification is appropriate for the data provided.

Formulaic package has two main functions – create.formula and reduce.existing.formula – and one subsidiary function, add.backtick. The main purpose of developing the package is to help users to build a robust model faster and more convenient.

create.formula automatically creates a formula from a provided list of input variables and the output variable. The variables will undergo a series of qualification tests such as automatic variable/categories reduction, typo, duplication, and lack of contrasted features elimination, etc. to make sure that a given feature is usable for modeling. This will reduce the time to build a model and set the users free from the trivial maneuver: manually inputting variables for modeling. The outcome of this formula can be used in a wide range from simple linear regression to more complex machine learning techniques such as random forest, neural network, etc.

The principal advantages of using create.formula are followed:

Being able to dynamically generate a formula from a vector of inputs, without necessarily having to spell them all out by name.
Adding variables by searching for patterns.
Simple integration of interactions.
Easy removal of specific variables.
Quality checks that resolve a variety of issues – typos, duplication, lack of contrast, etc. – while providing a transparent explanation.

reduce.existing.formula trims an existing formula down. Users plug an existing formula into the function, then it will undergo the same test as create.formula.

add.backticks applies backticks the variables needs backticks to be employed in a formula as default. Users can also add backticks to all the variables; however, it is not necessary.

formulaic is useful to create a dynamic formula with multiple features. It not only diminishes the time required for modeling and implementing, but also enriches the quality of the result.

awareness.name = "Awareness"
variable.names = c("Age", "Gender", "Income Group", "Region", "Persona", "Typo")

ex.form <-
  create.formula(outcome.name = awareness.name,
                 input.names = variable.names,
                 dat = snack.dat)

ex.form$formula
#> Awareness ~ Age + Gender + `Income Group` + Region + Persona
#> <environment: 0x00000000143e8148>
lm_example <- lm(formula = ex.form, data = snack.dat)
summary(lm_example)
#> 
#> Call:
#> lm(formula = ex.form, data = snack.dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.5853 -0.5245  0.4440  0.4743  0.5196 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                    0.5593478  0.0149319  37.460  < 2e-16 ***
#> Age                           -0.0001466  0.0001554  -0.943 0.345786    
#> GenderMale                    -0.0217796  0.0066156  -3.292 0.000996 ***
#> `Income Group`[ 50000, 75000)  0.0090771  0.0100586   0.902 0.366845    
#> `Income Group`[ 75000,100000)  0.0057200  0.0102907   0.556 0.578322    
#> `Income Group`[100000,150000)  0.0066225  0.0084932   0.780 0.435549    
#> `Income Group`[150000,250000]  0.0377699  0.0376171   1.004 0.315359    
#> RegionNortheast               -0.0009939  0.0097772  -0.102 0.919034    
#> RegionSouth                   -0.0096367  0.0117113  -0.823 0.410600    
#> RegionWest                    -0.0327901  0.0094162  -3.482 0.000498 ***
#> PersonaMainstream Maynard     -0.0030736  0.0105171  -0.292 0.770099    
#> PersonaMillenial Muncher      -0.0108831  0.0112126  -0.971 0.331752    
#> PersonaOld School Oliver      -0.0078579  0.0132893  -0.591 0.554332    
#> PersonaRighteous Reviewer     -0.0057425  0.0121762  -0.472 0.637203    
#> PersonaSavvy Samantha         -0.0111625  0.0110298  -1.012 0.311536    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4991 on 22985 degrees of freedom
#> Multiple R-squared:  0.001567,   Adjusted R-squared:  0.0009586 
#> F-statistic: 2.576 on 14 and 22985 DF,  p-value: 0.001026

Dynamic Generation of a Formula

A formula object may be one component of a larger system of software that processes data, generates models, and reports information. Dynamic applications with user interfaces, such as those generated with the shiny package, can allow a user to specify many of the parameters. This may include the type of model to fit, the outcome and input variables, and filters on the subset of the data to incorporate.

In this application, the user is provided with a wide array of choices. A variety of outcomes related to customer engagement may be modeled. The user can select a subset of data related to a specific brand or aggregate multiple brands together. The user may also choose from a menu of inputs spanning all relevant columns of the data set. Then these data can be filtered into specific subgroups based on selections across a number of variables, including age groups, gender, income groups, region, etc.

Because the user’s selections are dynamic, the modeling formula must be generated programmatically. The create.formula function includes parameters for the outcome.name – a character vector of length 1 – and the input.names – a character vector of any length. As an example, if the user provides specific selections, then create.formula will automatically generate the corresponding formula object:

user.outcome.name <- "Satisfaction"
user.input.names <- c('Age Group', 'Gender', 'Region')

create.formula(outcome.name = user.outcome.name, input.names = user.input.names)$formula
#> Satisfaction ~ `Age Group` + Gender + Region
#> <environment: 0x0000000014e267e0>

Dataset (snack.dat)

For the illustration of the basic functions of the formulaic package, we generated a dataset, named snack.dat.

Formatted as data.table object, snack.dat contains 23000 observations and 25 columns. These data contain simulated information from a fictionalized marketing survey. In this survey, a progression of questions was asked about the respondents’ awareness, consideration, consumption, satisfaction with, and advocacy for different brands of snack foods. Questions downstream of awareness, consideration, and then consumption would be asked only for those respondents who responded affirmatively to the previous question. Otherwise, the values are missing. Brand Perception questions are rated on a scale from 0 to 10 and indicated with a name starting with the prefix BP.

list(dim(snack.dat), names(snack.dat))
#> [[1]]
#> [1] 23000    25
#> 
#> [[2]]
#>  [1] "User ID"                   "Age"                      
#>  [3] "Gender"                    "Income"                   
#>  [5] "Region"                    "Persona"                  
#>  [7] "Product"                   "Awareness"                
#>  [9] "BP_For_Me_0_10"            "BP_Fits_Budget_0_10"      
#> [11] "BP_Tastes_Great_0_10"      "BP_Good_To_Share_0_10"    
#> [13] "BP_Like_Logo_0_10"         "BP_Special_Occasions_0_10"
#> [15] "BP_Everyday_Snack_0_10"    "BP_Healthy_0_10"          
#> [17] "BP_Delicious_0_10"         "BP_Right_Amount_0_10"     
#> [19] "BP_Relaxing_0_10"          "Consideration"            
#> [21] "Consumption"               "Satisfaction"             
#> [23] "Advocacy"                  "Age Group"                
#> [25] "Income Group"

Adding Backticks (add.backtick)

As a subsidiary function, add.backtick is used inside of create.formula function that adds backticks to the names of the variables. Formula objects include the names of different variables within a data.frame. When these names contain a space, the name must be encapsulated in backticks to ensure proper formatting. For instance, if there are three variables called y, x1, and User ID, then a formula formatted as y ~ x1 + User ID will generate errors due to the space in User ID. Instead, this formula can be properly formatted as y ~ x1 + `User ID`. Meanwhile, it is also acceptable to add backticks to the other names, such as `y` ~ `x1` + `User ID`, but this is not a necessary step. As a default, the include.backtick is set to ‘as.needed’, which indicates that the function will only add backticks to the variables that require them. The user has the freedom to change the option to ‘all’. Yet, it is only compatible when format.as != “formula”, in which case a character object is returned. In particular, a formula object will automatically remove unnecessary backticks.

NOTE: In the snack.dat data, User ID, Age Group, and Income Group are the only variables that are affected by the function when the included.backtick is set as ‘as.needed’, while every variable has backticks when it is set as ‘all’.

Update Notes: The add.backtick function now accepts the data.frame dat as an additional input (not required). When provided, the judgment about when to use backticks is based on whether the input exists as a variable name of dat and requires backticks. Other cases (like transformations) will not include them. Please refer to the example 5.

as.needed = formulaic::add.backtick(x = names(snack.dat), include.backtick = 'as.needed')
all = formulaic:::add.backtick(x = names(snack.dat), include.backtick = 'all')


data = cbind(as.needed, all)
list(data)
#> [[1]]
#>       as.needed                   all                          
#>  [1,] "`User ID`"                 "`User ID`"                  
#>  [2,] "Age"                       "`Age`"                      
#>  [3,] "Gender"                    "`Gender`"                   
#>  [4,] "Income"                    "`Income`"                   
#>  [5,] "Region"                    "`Region`"                   
#>  [6,] "Persona"                   "`Persona`"                  
#>  [7,] "Product"                   "`Product`"                  
#>  [8,] "Awareness"                 "`Awareness`"                
#>  [9,] "BP_For_Me_0_10"            "`BP_For_Me_0_10`"           
#> [10,] "BP_Fits_Budget_0_10"       "`BP_Fits_Budget_0_10`"      
#> [11,] "BP_Tastes_Great_0_10"      "`BP_Tastes_Great_0_10`"     
#> [12,] "BP_Good_To_Share_0_10"     "`BP_Good_To_Share_0_10`"    
#> [13,] "BP_Like_Logo_0_10"         "`BP_Like_Logo_0_10`"        
#> [14,] "BP_Special_Occasions_0_10" "`BP_Special_Occasions_0_10`"
#> [15,] "BP_Everyday_Snack_0_10"    "`BP_Everyday_Snack_0_10`"   
#> [16,] "BP_Healthy_0_10"           "`BP_Healthy_0_10`"          
#> [17,] "BP_Delicious_0_10"         "`BP_Delicious_0_10`"        
#> [18,] "BP_Right_Amount_0_10"      "`BP_Right_Amount_0_10`"     
#> [19,] "BP_Relaxing_0_10"          "`BP_Relaxing_0_10`"         
#> [20,] "Consideration"             "`Consideration`"            
#> [21,] "Consumption"               "`Consumption`"              
#> [22,] "Satisfaction"              "`Satisfaction`"             
#> [23,] "Advocacy"                  "`Advocacy`"                 
#> [24,] "`Age Group`"               "`Age Group`"                
#> [25,] "`Income Group`"            "`Income Group`"

This feature is automatically incorporated into formulaic’s create.formula method:

create.formula(outcome.name = awareness.name, input.names = variable.names)$formula
#> Awareness ~ Age + Gender + `Income Group` + Region + Persona + 
#>     Typo
#> <environment: 0x000000001f1284d8>

create.formula(
  outcome.name = awareness.name,
  input.names = variable.names,
  format.as = "character",
  include.backtick = "all"
)$formula
#> [1] "`Awareness` ~ `Age` + `Gender` + `Income Group` + `Region` + `Persona` + `Typo`"

create.formula(
  outcome.name = awareness.name,
  input.names = variable.names,
  format.as = "character",
  include.backtick = "all"
)$formula
#> [1] "`Awareness` ~ `Age` + `Gender` + `Income Group` + `Region` + `Persona` + `Typo`"

When the output is returned as a formula object, the backticks may only be provided on an as-needed basis. For character objects, either option may be selected.

create.formula(
  outcome.name = awareness.name,
  input.names = c(region.name, gender.name, sprintf("sqrt(%s^2)", age.name), income.group.name, "ldkao"),
  format.as = "character", 
  include.backtick = "as.needed"
)$formula
#> [1] "Awareness ~ Region + Gender + `sqrt(Age^2)` + `Income Group` + ldkao"

A transformation like sqrt(Age) should specifically not be placed inside of backticks.

Creating Formula (create.formula):

The create.formula function is designed to automatically generate formulas from user-specified inputs and output. The range of inputs may include directly specified variables, patterns to search within the names of an associated data.frame, a list of interactions, and a vector of variables to directly exclude from consideration. The method also provides a range of quality checks that can detect issues with the construction of formula and, at the user’s discretion, automatically remove variables that would otherwise generate errors. These quality checks include formatting variables with backticks, de-duplication, ensuring correspondence with the names of the variables in an associated data.frame, excluding categorical variables that would generate errors due to a lack of contrast or exceed a user-specified threshold for the maximum number of categories, and automatically removing interactions involving variables that should be excluded. When directed by the user, these quality checks can be implemented to effectively reduce a formula to the subset of variables and interactions that would be appropriate for consideration in a statistical model. The output of the function can be formatted as either a formula object or a character.

Update Notes: the create.formula function now evaluates the inputs and outcomes to see if real results are generated rather than looking for the names of existing variables. This means that any transformation – e.g. sqrt(Age^2 * log(Income)) – is potentially available for inclusion. Likewise, the outcomes can also be transformations like log(Income).

Parameter description:

outcome.name A character value specifying the name of the formula’s outcome variable. In this version, only a single outcome may be included. The first entry of outcome.name will be used to build the formula.
input.names The names of the variables with the full names delineated.
input.patterns Includes additional input variables. The user may enter patterns – e.g. to include every variable with a name that includes the pattern. Multiple patterns may be included as a character vector. However, each pattern may not contain spaces and is otherwise subject to the same limits on patterns as used in the grep function.
dat User can specify a data.frame object that will be used to remove any variables that are not listed in names(dat. As default it is set as NULL. In this case, the formula is created simply from the outcome.name and input.names.
interactions A list of character vectors. Each character vector includes the names of the variables that form a single interaction. Specifying interactions = list(c(“x”, “y”), c(“x”, “z”), c(“y”, “z”), c(“x”, “y”, “z”)) would lead to the interactions xy + xz + yz + xy*z. #’ @param force.main.effects This is a logical value. When TRUE, the intent is that any term included as an interaction (of multiple variables) must also be listed individually as a main effect.
reduce A logical value. When dat is not NULL and reduce is TRUE, additional quality checks are performed to examine the input variables. Any input variables that exhibit a lack of contrast will be excluded from the model. This search is global by default but may be conducted separately in subsets of the outcome variables by specifying max.outcome.categories.to.search. Additionally, any input variables that exhibit too many contrasts, as defined by max.input.categories, will also be excluded.
max.input.categories Limits the maximum number of variables that will be employed in the formula. As default it is set at 20, but users can still change at his/her convenience.
max.outcome.categories.to.search A numeric value. The create.formula function includes a feature that identifies input variables exhibiting a lack of contrast. When reduce = TRUE, these variables are automatically excluded from the resulting formula. This search may be expanded to subsets of the outcome when the number of unique measured values of the outcome is no greater than max.outcome.categories.to.search. In this case, each subset of the outcome will be separately examined, and any inputs that exhibit a lack of contrast within at least one subset will be excluded.
order.as User can specify the order the input variables in the formula in a variety of ways for patterns: increasing for increasing alphabet order, decreasing for decreasing alphabet order, column.order for as they appear in data, and as.specified for maintaining the user’s specified order.
include.backtick Add backticks if needed. As default it is set as ‘as.needed’, which add backticks when only it is needed. The other option is ‘all’. The use of include.backtick = “all” is limited to cases in which the output is generated as a character variable. When the output is generated as a formula object, then R automatically removes all unnecessary backticks. That is, it is only compatible when format.as != formula.
format.as The data type of the output. If not set as “formula”, then a character vector will be returned.
variables.to.exclude A character vector. Any variable specified in variables.to.exclude will be dropped from the formula, both in the individual inputs and in any associated interactions. This step supersedes the inclusion of any variables specified for inclusion in the other parameters.
include.intercept A logical value. When FALSE, the intercept will be removed from the formula.

Basic format

outcome.name.awareness <- "Awareness"
input.names <-
  c("Age", "Gender", "Income", "Region", "Persona", "Typo")

basic.form <-
  create.formula(outcome.name = outcome.name.awareness,
                 input.names = input.names,
                 dat = snack.dat)

print(basic.form)
#> $formula
#> Awareness ~ Age + Gender + Income + Region + Persona
#> <environment: 0x000000001f90eaa0>
#> 
#> $inclusion.table
#>    variable exclude.null.quantity   class order specified.from
#> 1:      Age                 FALSE integer     1    input.names
#> 2:   Gender                 FALSE  factor     2    input.names
#> 3:   Income                 FALSE numeric     3    input.names
#> 4:   Region                 FALSE  factor     4    input.names
#> 5:  Persona                 FALSE  factor     5    input.names
#> 6:     Typo                  TRUE    <NA>     6    input.names
#>    exclude.user.specified exclude.matches.outcome.name include.variable
#> 1:                  FALSE                        FALSE             TRUE
#> 2:                  FALSE                        FALSE             TRUE
#> 3:                  FALSE                        FALSE             TRUE
#> 4:                  FALSE                        FALSE             TRUE
#> 5:                  FALSE                        FALSE             TRUE
#> 6:                     NA                        FALSE            FALSE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Creating Interactions of Variables

The function allows users to incorporate interaction terms easily with the interactions parameter. Each interaction would be specified as a character vector, and the entire range of interactions is entered as a list, which allows for different interactions to include a different number of variables:

interactions = list(c("Age Group", "Gender"),
                    c("Age Group", "Region"),
                    c("Age Group", "Gender", "Region"))

interaction.form <-
  create.formula(
    outcome.name = outcome.name.awareness,
    input.names = input.names,
    dat = snack.dat,
    interactions = interactions
  )

print(interaction.form)
#> $formula
#> Awareness ~ Age + Gender + Income + Region + Persona + `Age Group` + 
#>     `Age Group` * Gender + `Age Group` * Region + `Age Group` * 
#>     Gender * Region
#> <environment: 0x0000000020ad7fc0>
#> 
#> $inclusion.table
#>     variable exclude.null.quantity   class order specified.from
#> 1:       Age                 FALSE integer     1    input.names
#> 2:    Gender                 FALSE  factor     2    input.names
#> 3:    Income                 FALSE numeric     3    input.names
#> 4:    Region                 FALSE  factor     4    input.names
#> 5:   Persona                 FALSE  factor     5    input.names
#> 6:      Typo                  TRUE    <NA>     6    input.names
#> 7: Age Group                 FALSE  factor     7   interactions
#>    exclude.user.specified exclude.matches.outcome.name include.variable
#> 1:                  FALSE                        FALSE             TRUE
#> 2:                  FALSE                        FALSE             TRUE
#> 3:                  FALSE                        FALSE             TRUE
#> 4:                  FALSE                        FALSE             TRUE
#> 5:                  FALSE                        FALSE             TRUE
#> 6:                     NA                        FALSE            FALSE
#> 7:                  FALSE                        FALSE             TRUE
#> 
#> $interactions.table
#>                     interactions include.interaction
#> 1:          `Age Group` * Gender                TRUE
#> 2:          `Age Group` * Region                TRUE
#> 3: `Age Group` * Gender * Region                TRUE

Selecting Variables from Patterns

Large data sets may include classes of variables that are identified with a common pattern within their names. Rather than including each variable individually, it can be helpful to programmatically identify all of the variables that correspond to a specific pattern. For instance, the variables with prefix of BP_ in the snack.dat dataset.

When a set of patterns is specified with the input.patterns parameter, the create.formula function identifies any variable that includes at least one of these patterns for inclusion in the formula. In order to do so, the user must also specify the data to be searched. As an example, consider the example below:

bp.pattern = "BP_"
input.patterns = c("Gend", bp.pattern)

pattern.form <-
  create.formula(
    outcome.name = outcome.name.awareness,
    input.names = input.names,
    dat = snack.dat,
    input.patterns = input.patterns
  )

print(pattern.form)
#> $formula
#> Awareness ~ Age + Gender + Income + Region + Persona + BP_For_Me_0_10 + 
#>     BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + BP_Good_To_Share_0_10 + 
#>     BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + BP_Everyday_Snack_0_10 + 
#>     BP_Healthy_0_10 + BP_Delicious_0_10 + BP_Right_Amount_0_10 + 
#>     BP_Relaxing_0_10
#> <environment: 0x000000001eecf398>
#> 
#> $inclusion.table
#>                      variable exclude.null.quantity   class order
#>  1:                       Age                 FALSE integer     1
#>  2:                    Gender                 FALSE  factor     2
#>  3:                    Income                 FALSE numeric     3
#>  4:                    Region                 FALSE  factor     4
#>  5:                   Persona                 FALSE  factor     5
#>  6:                      Typo                  TRUE    <NA>     6
#>  7:            BP_For_Me_0_10                 FALSE integer     7
#>  8:       BP_Fits_Budget_0_10                 FALSE integer     8
#>  9:      BP_Tastes_Great_0_10                 FALSE integer     9
#> 10:     BP_Good_To_Share_0_10                 FALSE integer    10
#> 11:         BP_Like_Logo_0_10                 FALSE integer    11
#> 12: BP_Special_Occasions_0_10                 FALSE integer    12
#> 13:    BP_Everyday_Snack_0_10                 FALSE integer    13
#> 14:           BP_Healthy_0_10                 FALSE integer    14
#> 15:         BP_Delicious_0_10                 FALSE integer    15
#> 16:      BP_Right_Amount_0_10                 FALSE integer    16
#> 17:          BP_Relaxing_0_10                 FALSE integer    17
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>  1:    input.names                  FALSE                        FALSE
#>  2:    input.names                  FALSE                        FALSE
#>  3:    input.names                  FALSE                        FALSE
#>  4:    input.names                  FALSE                        FALSE
#>  5:    input.names                  FALSE                        FALSE
#>  6:    input.names                     NA                        FALSE
#>  7: input.patterns                  FALSE                        FALSE
#>  8: input.patterns                  FALSE                        FALSE
#>  9: input.patterns                  FALSE                        FALSE
#> 10: input.patterns                  FALSE                        FALSE
#> 11: input.patterns                  FALSE                        FALSE
#> 12: input.patterns                  FALSE                        FALSE
#> 13: input.patterns                  FALSE                        FALSE
#> 14: input.patterns                  FALSE                        FALSE
#> 15: input.patterns                  FALSE                        FALSE
#> 16: input.patterns                  FALSE                        FALSE
#> 17: input.patterns                  FALSE                        FALSE
#>     include.variable
#>  1:             TRUE
#>  2:             TRUE
#>  3:             TRUE
#>  4:             TRUE
#>  5:             TRUE
#>  6:            FALSE
#>  7:             TRUE
#>  8:             TRUE
#>  9:             TRUE
#> 10:             TRUE
#> 11:             TRUE
#> 12:             TRUE
#> 13:             TRUE
#> 14:             TRUE
#> 15:             TRUE
#> 16:             TRUE
#> 17:             TRUE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

In this example, the age group was directly specified by the user. Gender was incorporated due to the first pattern, and then all of the brand perceptions were selected based on the second pattern, “BP_”.

Selecting All of the Variables

The create.formula function maintains this capability when “.” is included in the input.names and a data set is provided:

dot.form.1 <-
  create.formula(outcome.name = outcome.name.awareness,
                 input.names = ".",
                 dat = snack.dat)

print(dot.form.1)
#> $formula
#> Awareness ~ `User ID` + Age + Gender + Income + Region + Persona + 
#>     Product + BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x0000000020ec02d8>
#> 
#> $inclusion.table
#>                      variable exclude.null.quantity   class order
#>  1:                   User ID                 FALSE  factor     1
#>  2:                       Age                 FALSE integer     2
#>  3:                    Gender                 FALSE  factor     3
#>  4:                    Income                 FALSE numeric     4
#>  5:                    Region                 FALSE  factor     5
#>  6:                   Persona                 FALSE  factor     6
#>  7:                   Product                 FALSE  factor     7
#>  8:                 Awareness                 FALSE integer     8
#>  9:            BP_For_Me_0_10                 FALSE integer     9
#> 10:       BP_Fits_Budget_0_10                 FALSE integer    10
#> 11:      BP_Tastes_Great_0_10                 FALSE integer    11
#> 12:     BP_Good_To_Share_0_10                 FALSE integer    12
#> 13:         BP_Like_Logo_0_10                 FALSE integer    13
#> 14: BP_Special_Occasions_0_10                 FALSE integer    14
#> 15:    BP_Everyday_Snack_0_10                 FALSE integer    15
#> 16:           BP_Healthy_0_10                 FALSE integer    16
#> 17:         BP_Delicious_0_10                 FALSE integer    17
#> 18:      BP_Right_Amount_0_10                 FALSE integer    18
#> 19:          BP_Relaxing_0_10                 FALSE integer    19
#> 20:             Consideration                 FALSE integer    20
#> 21:               Consumption                 FALSE integer    21
#> 22:              Satisfaction                 FALSE integer    22
#> 23:                  Advocacy                 FALSE integer    23
#> 24:                 Age Group                 FALSE  factor    24
#> 25:              Income Group                 FALSE  factor    25
#>                      variable exclude.null.quantity   class order
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>  1:    input.names                  FALSE                        FALSE
#>  2:    input.names                  FALSE                        FALSE
#>  3:    input.names                  FALSE                        FALSE
#>  4:    input.names                  FALSE                        FALSE
#>  5:    input.names                  FALSE                        FALSE
#>  6:    input.names                  FALSE                        FALSE
#>  7:    input.names                  FALSE                        FALSE
#>  8:    input.names                  FALSE                         TRUE
#>  9:    input.names                  FALSE                        FALSE
#> 10:    input.names                  FALSE                        FALSE
#> 11:    input.names                  FALSE                        FALSE
#> 12:    input.names                  FALSE                        FALSE
#> 13:    input.names                  FALSE                        FALSE
#> 14:    input.names                  FALSE                        FALSE
#> 15:    input.names                  FALSE                        FALSE
#> 16:    input.names                  FALSE                        FALSE
#> 17:    input.names                  FALSE                        FALSE
#> 18:    input.names                  FALSE                        FALSE
#> 19:    input.names                  FALSE                        FALSE
#> 20:    input.names                  FALSE                        FALSE
#> 21:    input.names                  FALSE                        FALSE
#> 22:    input.names                  FALSE                        FALSE
#> 23:    input.names                  FALSE                        FALSE
#> 24:    input.names                  FALSE                        FALSE
#> 25:    input.names                  FALSE                        FALSE
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>     include.variable
#>  1:             TRUE
#>  2:             TRUE
#>  3:             TRUE
#>  4:             TRUE
#>  5:             TRUE
#>  6:             TRUE
#>  7:             TRUE
#>  8:            FALSE
#>  9:             TRUE
#> 10:             TRUE
#> 11:             TRUE
#> 12:             TRUE
#> 13:             TRUE
#> 14:             TRUE
#> 15:             TRUE
#> 16:             TRUE
#> 17:             TRUE
#> 18:             TRUE
#> 19:             TRUE
#> 20:             TRUE
#> 21:             TRUE
#> 22:             TRUE
#> 23:             TRUE
#> 24:             TRUE
#> 25:             TRUE
#>     include.variable
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

It is unnecessary, but user may want to add another variable as the following example demonstrates. create.formula will handle the duplicated variable, here “Gender”, and incorporate the variables that pass the quality checks:


input.names = c("Gender", ".")

dot.form.2 <- create.formula(outcome.name = outcome.name.awareness, input.names = input.names, dat = snack.dat)

print(dot.form.2)
#> $formula
#> Awareness ~ Gender + `User ID` + Age + Income + Region + Persona + 
#>     Product + BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x0000000020a26160>
#> 
#> $inclusion.table
#>                      variable exclude.null.quantity   class order
#>  1:                    Gender                 FALSE  factor     1
#>  2:                   User ID                 FALSE  factor     2
#>  3:                       Age                 FALSE integer     3
#>  4:                    Income                 FALSE numeric     4
#>  5:                    Region                 FALSE  factor     5
#>  6:                   Persona                 FALSE  factor     6
#>  7:                   Product                 FALSE  factor     7
#>  8:                 Awareness                 FALSE integer     8
#>  9:            BP_For_Me_0_10                 FALSE integer     9
#> 10:       BP_Fits_Budget_0_10                 FALSE integer    10
#> 11:      BP_Tastes_Great_0_10                 FALSE integer    11
#> 12:     BP_Good_To_Share_0_10                 FALSE integer    12
#> 13:         BP_Like_Logo_0_10                 FALSE integer    13
#> 14: BP_Special_Occasions_0_10                 FALSE integer    14
#> 15:    BP_Everyday_Snack_0_10                 FALSE integer    15
#> 16:           BP_Healthy_0_10                 FALSE integer    16
#> 17:         BP_Delicious_0_10                 FALSE integer    17
#> 18:      BP_Right_Amount_0_10                 FALSE integer    18
#> 19:          BP_Relaxing_0_10                 FALSE integer    19
#> 20:             Consideration                 FALSE integer    20
#> 21:               Consumption                 FALSE integer    21
#> 22:              Satisfaction                 FALSE integer    22
#> 23:                  Advocacy                 FALSE integer    23
#> 24:                 Age Group                 FALSE  factor    24
#> 25:              Income Group                 FALSE  factor    25
#>                      variable exclude.null.quantity   class order
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>  1:    input.names                  FALSE                        FALSE
#>  2:    input.names                  FALSE                        FALSE
#>  3:    input.names                  FALSE                        FALSE
#>  4:    input.names                  FALSE                        FALSE
#>  5:    input.names                  FALSE                        FALSE
#>  6:    input.names                  FALSE                        FALSE
#>  7:    input.names                  FALSE                        FALSE
#>  8:    input.names                  FALSE                         TRUE
#>  9:    input.names                  FALSE                        FALSE
#> 10:    input.names                  FALSE                        FALSE
#> 11:    input.names                  FALSE                        FALSE
#> 12:    input.names                  FALSE                        FALSE
#> 13:    input.names                  FALSE                        FALSE
#> 14:    input.names                  FALSE                        FALSE
#> 15:    input.names                  FALSE                        FALSE
#> 16:    input.names                  FALSE                        FALSE
#> 17:    input.names                  FALSE                        FALSE
#> 18:    input.names                  FALSE                        FALSE
#> 19:    input.names                  FALSE                        FALSE
#> 20:    input.names                  FALSE                        FALSE
#> 21:    input.names                  FALSE                        FALSE
#> 22:    input.names                  FALSE                        FALSE
#> 23:    input.names                  FALSE                        FALSE
#> 24:    input.names                  FALSE                        FALSE
#> 25:    input.names                  FALSE                        FALSE
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>     include.variable
#>  1:             TRUE
#>  2:             TRUE
#>  3:             TRUE
#>  4:             TRUE
#>  5:             TRUE
#>  6:             TRUE
#>  7:             TRUE
#>  8:            FALSE
#>  9:             TRUE
#> 10:             TRUE
#> 11:             TRUE
#> 12:             TRUE
#> 13:             TRUE
#> 14:             TRUE
#> 15:             TRUE
#> 16:             TRUE
#> 17:             TRUE
#> 18:             TRUE
#> 19:             TRUE
#> 20:             TRUE
#> 21:             TRUE
#> 22:             TRUE
#> 23:             TRUE
#> 24:             TRUE
#> 25:             TRUE
#>     include.variable
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Also, if user adds another variable and misspells it that it is not a column name of the dataset as the following example shows, create.formula will drop the misspelled variable, here “Typo”, and incorporate the variables that pass the quality checks:


input.names = c("Typo", ".")

dot.form.2 <- create.formula(outcome.name = outcome.name.awareness, input.names = input.names, dat = snack.dat)

print(dot.form.2)
#> $formula
#> Awareness ~ `User ID` + Age + Gender + Income + Region + Persona + 
#>     Product + BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x0000000020c946e8>
#> 
#> $inclusion.table
#>                      variable exclude.null.quantity   class order
#>  1:                      Typo                  TRUE    <NA>     1
#>  2:                   User ID                 FALSE  factor     2
#>  3:                       Age                 FALSE integer     3
#>  4:                    Gender                 FALSE  factor     4
#>  5:                    Income                 FALSE numeric     5
#>  6:                    Region                 FALSE  factor     6
#>  7:                   Persona                 FALSE  factor     7
#>  8:                   Product                 FALSE  factor     8
#>  9:                 Awareness                 FALSE integer     9
#> 10:            BP_For_Me_0_10                 FALSE integer    10
#> 11:       BP_Fits_Budget_0_10                 FALSE integer    11
#> 12:      BP_Tastes_Great_0_10                 FALSE integer    12
#> 13:     BP_Good_To_Share_0_10                 FALSE integer    13
#> 14:         BP_Like_Logo_0_10                 FALSE integer    14
#> 15: BP_Special_Occasions_0_10                 FALSE integer    15
#> 16:    BP_Everyday_Snack_0_10                 FALSE integer    16
#> 17:           BP_Healthy_0_10                 FALSE integer    17
#> 18:         BP_Delicious_0_10                 FALSE integer    18
#> 19:      BP_Right_Amount_0_10                 FALSE integer    19
#> 20:          BP_Relaxing_0_10                 FALSE integer    20
#> 21:             Consideration                 FALSE integer    21
#> 22:               Consumption                 FALSE integer    22
#> 23:              Satisfaction                 FALSE integer    23
#> 24:                  Advocacy                 FALSE integer    24
#> 25:                 Age Group                 FALSE  factor    25
#> 26:              Income Group                 FALSE  factor    26
#>                      variable exclude.null.quantity   class order
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>  1:    input.names                     NA                        FALSE
#>  2:    input.names                  FALSE                        FALSE
#>  3:    input.names                  FALSE                        FALSE
#>  4:    input.names                  FALSE                        FALSE
#>  5:    input.names                  FALSE                        FALSE
#>  6:    input.names                  FALSE                        FALSE
#>  7:    input.names                  FALSE                        FALSE
#>  8:    input.names                  FALSE                        FALSE
#>  9:    input.names                  FALSE                         TRUE
#> 10:    input.names                  FALSE                        FALSE
#> 11:    input.names                  FALSE                        FALSE
#> 12:    input.names                  FALSE                        FALSE
#> 13:    input.names                  FALSE                        FALSE
#> 14:    input.names                  FALSE                        FALSE
#> 15:    input.names                  FALSE                        FALSE
#> 16:    input.names                  FALSE                        FALSE
#> 17:    input.names                  FALSE                        FALSE
#> 18:    input.names                  FALSE                        FALSE
#> 19:    input.names                  FALSE                        FALSE
#> 20:    input.names                  FALSE                        FALSE
#> 21:    input.names                  FALSE                        FALSE
#> 22:    input.names                  FALSE                        FALSE
#> 23:    input.names                  FALSE                        FALSE
#> 24:    input.names                  FALSE                        FALSE
#> 25:    input.names                  FALSE                        FALSE
#> 26:    input.names                  FALSE                        FALSE
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>     include.variable
#>  1:            FALSE
#>  2:             TRUE
#>  3:             TRUE
#>  4:             TRUE
#>  5:             TRUE
#>  6:             TRUE
#>  7:             TRUE
#>  8:             TRUE
#>  9:            FALSE
#> 10:             TRUE
#> 11:             TRUE
#> 12:             TRUE
#> 13:             TRUE
#> 14:             TRUE
#> 15:             TRUE
#> 16:             TRUE
#> 17:             TRUE
#> 18:             TRUE
#> 19:             TRUE
#> 20:             TRUE
#> 21:             TRUE
#> 22:             TRUE
#> 23:             TRUE
#> 24:             TRUE
#> 25:             TRUE
#> 26:             TRUE
#>     include.variable
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Removing Specific Variables

With multiple ways to specify the variables to include in a formula, it can also be helpful to ensure that a specific variable may not be included. As an example, when utilizing the input.patterns to include all of the brand perception variables, we can specifically remove BP_Delicious_0_10 and Gender by specifying the variables.to.exclude parameter. The parameter supersedes any variables mentioned in input.names as well as interactions:

input.names <-
  c("Age",
    "Gender",
    "Income",
    "Region",
    "Persona",
    "Typo",
    "Age Group")
interactions <-
  list(
    c("Age", "Gender"),
    c("Age", "Income"),
    c("Age", "Gender", "Income"),
    c("Gender", "Inco"),
    c("Age", "Reg ion")
  )
bp.pattern = "BP_"
variables.to.exclude = c("BP_Delicious_0_10", "Gender")

variables.to.exclude.form <-
  create.formula(
    outcome.name = outcome.name.awareness,
    input.names = input.names,
    interactions = interactions,
    input.patterns = bp.pattern,
    variables.to.exclude = variables.to.exclude,
    dat = snack.dat
  )


print(variables.to.exclude.form)
#> $formula
#> Awareness ~ Age + Income + Region + Persona + `Age Group` + BP_For_Me_0_10 + 
#>     BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + BP_Good_To_Share_0_10 + 
#>     BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + BP_Everyday_Snack_0_10 + 
#>     BP_Healthy_0_10 + BP_Right_Amount_0_10 + BP_Relaxing_0_10 + 
#>     Age * Income
#> <environment: 0x000000002102a850>
#> 
#> $inclusion.table
#>                      variable exclude.null.quantity   class order
#>  1:                       Age                 FALSE integer     1
#>  2:                    Gender                 FALSE  factor     2
#>  3:                    Income                 FALSE numeric     3
#>  4:                    Region                 FALSE  factor     4
#>  5:                   Persona                 FALSE  factor     5
#>  6:                      Typo                  TRUE    <NA>     6
#>  7:                 Age Group                 FALSE  factor     7
#>  8:            BP_For_Me_0_10                 FALSE integer     8
#>  9:       BP_Fits_Budget_0_10                 FALSE integer     9
#> 10:      BP_Tastes_Great_0_10                 FALSE integer    10
#> 11:     BP_Good_To_Share_0_10                 FALSE integer    11
#> 12:         BP_Like_Logo_0_10                 FALSE integer    12
#> 13: BP_Special_Occasions_0_10                 FALSE integer    13
#> 14:    BP_Everyday_Snack_0_10                 FALSE integer    14
#> 15:           BP_Healthy_0_10                 FALSE integer    15
#> 16:         BP_Delicious_0_10                 FALSE integer    16
#> 17:      BP_Right_Amount_0_10                 FALSE integer    17
#> 18:          BP_Relaxing_0_10                 FALSE integer    18
#> 19:                      Inco                  TRUE    <NA>    19
#> 20:                   Reg ion                  TRUE    <NA>    20
#>     specified.from exclude.user.specified exclude.matches.outcome.name
#>  1:    input.names                  FALSE                        FALSE
#>  2:    input.names                   TRUE                        FALSE
#>  3:    input.names                  FALSE                        FALSE
#>  4:    input.names                  FALSE                        FALSE
#>  5:    input.names                  FALSE                        FALSE
#>  6:    input.names                     NA                        FALSE
#>  7:    input.names                  FALSE                        FALSE
#>  8: input.patterns                  FALSE                        FALSE
#>  9: input.patterns                  FALSE                        FALSE
#> 10: input.patterns                  FALSE                        FALSE
#> 11: input.patterns                  FALSE                        FALSE
#> 12: input.patterns                  FALSE                        FALSE
#> 13: input.patterns                  FALSE                        FALSE
#> 14: input.patterns                  FALSE                        FALSE
#> 15: input.patterns                  FALSE                        FALSE
#> 16: input.patterns                   TRUE                        FALSE
#> 17: input.patterns                  FALSE                        FALSE
#> 18: input.patterns                  FALSE                        FALSE
#> 19:   interactions                     NA                        FALSE
#> 20:   interactions                     NA                        FALSE
#>     include.variable
#>  1:             TRUE
#>  2:            FALSE
#>  3:             TRUE
#>  4:             TRUE
#>  5:             TRUE
#>  6:            FALSE
#>  7:             TRUE
#>  8:             TRUE
#>  9:             TRUE
#> 10:             TRUE
#> 11:             TRUE
#> 12:             TRUE
#> 13:             TRUE
#> 14:             TRUE
#> 15:             TRUE
#> 16:            FALSE
#> 17:             TRUE
#> 18:             TRUE
#> 19:            FALSE
#> 20:            FALSE
#> 
#> $interactions.table
#>             interactions include.interaction
#> 1:          Age * Gender               FALSE
#> 2:          Age * Income                TRUE
#> 3: Age * Gender * Income               FALSE
#> 4:         Gender * Inco               FALSE
#> 5:         Age * Reg ion               FALSE

Quality Checks

With the create.formula function, the formulaic package devises a range of quality checks that investigate the design of a formula. The degree of quality checks can be controlled by the user at several levels. When the user specifies that quality checks should be performed, the create.formula method builds objects called inclusion.table and interactions.table, which form a portion of the method’s output. The inclusion.table object is a data.frame that reports on each variable that was considered for inclusion in the final list of inputs. Ultimately, the inclusion.table object will include a variety of columns, one for each quality check, that each indicates whether a variable should be excluded. Once all of the specified quality checks have been performed, the include.variable column is computed as an overall indicator of whether the specified variable should be included as an input in the formula object.

The interactions.table follows a similar logic. An interaction will be excluded if any of the variables in its components was excluded based on the quality checks in the inclusion.table.

Outcomes as Inputs

Most formula objects would not include the outcome variable as an input. However, when such a formula is constructed, whether by mistake or with intention, there is a lack of consistency is the way many common models handle the issue(outcomes as inputs situation). For instance, Income ~ Age + Income. The function drops the outcome variable in inputs automatically, and return the formula as followed: Income ~ Age.

input.names <- c("Income", "Age", "Income")
income.name = "Income"

outcomes.as.inputs.form <-
  create.formula(outcome.name = income.name,
                 input.names = input.names,
                 dat = snack.dat)

print(outcomes.as.inputs.form)
#> $formula
#> Income ~ Age
#> <environment: 0x0000000020990138>
#> 
#> $inclusion.table
#>    variable exclude.null.quantity   class order specified.from
#> 1:   Income                 FALSE numeric     1    input.names
#> 2:      Age                 FALSE integer     2    input.names
#>    exclude.user.specified exclude.matches.outcome.name include.variable
#> 1:                  FALSE                         TRUE            FALSE
#> 2:                  FALSE                        FALSE             TRUE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Removing duplicated variables

There is a chance that users accidentally or intentionally add the same variable more than once in input, which we call duplicated variables. The create.formula will build formula with the unique variable names into the formula:

duplicated.inputs <- c(rep.int(x = "Age", times = 2), "Income")
duplicated.interactions <-
  list(c("Age", "Income"), c("Age", "Income"))

duplicated.form <-
  create.formula(
    outcome.name = outcome.name.awareness,
    input.names = duplicated.inputs,
    interactions = duplicated.interactions,
    dat = snack.dat
  )

print(duplicated.form)
#> $formula
#> Awareness ~ Age + Income + Age * Income
#> <environment: 0x000000001729c8a8>
#> 
#> $inclusion.table
#>    variable exclude.null.quantity   class order specified.from
#> 1:      Age                 FALSE integer     1    input.names
#> 2:   Income                 FALSE numeric     2    input.names
#>    exclude.user.specified exclude.matches.outcome.name include.variable
#> 1:                  FALSE                        FALSE             TRUE
#> 2:                  FALSE                        FALSE             TRUE
#> 
#> $interactions.table
#>    interactions include.interaction
#> 1: Age * Income                TRUE

Misspecified Variables

A formula object in R can only be supplied to a model when all of its terms directly match the names of the data.frame object on which the model will be fit. Misspecified variables within a formula, such as those arising from typographical errors, will typically lead to error messages in R’s implementation of a model. The formulaic package provides the option to either a) maintain this effect or b) automatically remove any misspecified variables. When a user supplies a dataset to the create.formula function, the variables intended for the formula receive a quality check to ensure that they match a corresponding name within the associated data.frame. Misspecified variables will be marked in the inclusion.table portion of the output of the create.formula function. Any misspecified variables or associated interactions will be removed from the formula in this setting.

input.names <- c("Age", "Typo")
income.name <- "Income"

formula.with.typo <-
  create.formula(outcome.name = income.name, input.names = input.names)
print(formula.with.typo)
#> $formula
#> Income ~ Age + Typo
#> <environment: 0x000000001efd4c78>
#> 
#> $inclusion.table
#> Null data.table (0 rows and 0 cols)
#> 
#> $interactions.table
#> Null data.table (0 rows and 0 cols)

formula.without.typo <-
  create.formula(outcome.name = income.name,
                 input.names = input.names,
                 dat = snack.dat)
print(formula.without.typo)
#> $formula
#> Income ~ Age
#> <environment: 0x000000001f1fb228>
#> 
#> $inclusion.table
#>    variable exclude.null.quantity   class order specified.from
#> 1:      Age                 FALSE integer     1    input.names
#> 2:     Typo                  TRUE    <NA>     2    input.names
#>    exclude.user.specified exclude.matches.outcome.name include.variable
#> 1:                  FALSE                        FALSE             TRUE
#> 2:                     NA                        FALSE            FALSE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Considerations for Feature Engineering

Selecting appropriate variables for a statistical model can include challenges associated with the domain, methodology, computational considerations, and practical limitations of the data. Some variables may not be suitable for inclusion based on either a lack of contrast or a large number of categories. This section will explore these problems in greater detail. In doing so, we will demonstrate how the formulaic package can automatically identify and handle these issues.

A Lack of Contrast

Statistical models typically estimate the relationship between outcome and the inputs based upon the impact of changes in the inputs. When a variable is constant across all of its measured values, its variance is zero, and therefore the variable’s correlation with another variable is undefined. A constant input variable therefore exhibits a lack of contrast with regard to estimating its impact on an outcome. Many statistical models in R will return error messages when an input is a constant variable or consists only of missing data. If a large number of variables are included, then each error message will only identify the first such variable. An iterative process may be required to remove variables with a lack of contrast. Furthermore, even in variables that exhibit variation across the full range of the data, a lack of contrast may yet arise when a model is fit on a subset of these data.

Numeric Variables With No Variation

snack.dat[, .N, keyby = c("Awareness", "Consideration")]
#>    Awareness Consideration     N
#> 1:         0            NA 10906
#> 2:         1             0  6907
#> 3:         1             1  5187

A model of consideration, estimated on the rows for which this outcome is measured, would therefore only include values of 1 for the respondents’ awareness. A logistic regression that includes awareness as an input would therefore generate a missing value for the coefficient of awareness:

formula.consideration <-
  create.formula(outcome.name = consideration.name,
                 input.names = c(age.name, awareness.name))

print(formula.consideration$formula)
#> Consideration ~ Age + Awareness
#> <environment: 0x000000001fcab108>

glm(formula = formula.consideration,
    data = snack.dat,
    family = "binomial")$coefficients
#>   (Intercept)           Age     Awareness 
#> -0.2556379284 -0.0005580826            NA

Because the awareness variable lacks variation in this subset, it is not suitable for use as a predictor of consideration. (It should instead be viewed as a prerequisite.) This matter can be resolved through the use of the reduce parameter in the create.formula function. When reduce = TRUE and a dataset is provided for inspection, create.formula automatically performs quality checks on all of the potential input variables. Any variables with a lack of variation will be identified and proactively excluded from the formula. Meanwhile, a record of this inspection is provided in the inclusion.table’s output:

formula.consideration <-
  create.formula(
    outcome.name = consideration.name,
    input.names = c(age.name, awareness.name),
    dat = snack.dat,
    reduce = TRUE
  )

print(formula.consideration)
#> $formula
#> Consideration ~ Age
#> <environment: 0x000000002017b428>
#> 
#> $inclusion.table
#>     variable exclude.null.quantity   class order specified.from
#> 1:       Age                 FALSE integer     1    input.names
#> 2: Awareness                 FALSE integer     2    input.names
#>    exclude.user.specified exclude.matches.outcome.name min.categories
#> 1:                  FALSE                        FALSE             75
#> 2:                  FALSE                        FALSE              1
#>    exclude.lack.contrast exclude.numerous.categories include.variable
#> 1:                 FALSE                       FALSE             TRUE
#> 2:                  TRUE                       FALSE            FALSE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

Categorical Variables With No Variation

To incorporate categorical variables with k > 1 different measured values, statistical models typically code separate columns of indicator variables across k-1 categories, while the kth category serves as a reference. Without variation (k <= 1), this procedure cannot code any indicator variables. Without a meaningful way to include such a variable as an input, the model will instead generate an error message.

As an example, consider a model generated on the subset of respondents between the ages of 18 and 35 years old. This represents one category of the possible age groups. If a logistic regression model nonetheless attempted to include the age group as an input, this would lead to the following result:

formula.awareness <-
  create.formula(outcome.name = awareness.name,
                 input.names = c(age.group.name, gender.name))

print(formula.awareness$formula)
#> Awareness ~ `Age Group` + Gender
#> <environment: 0x000000001ee2e688>

#{r create.formula with lack of contrast 1} #glm(formula = formula.awareness$formula, data = #snack.dat[get(age.group.name) == "[ 18, 35)",], family = "binomial") #

This particular example is designed to demonstrate the issue with a simple contradiction, and its root cause is easy to identify. In real applications, significant investigation may be required to determine which variables may be causing such an effect. The error message provided informs the user of a lack of contrast, but it does not identify which variable is causing the issue. In a formula that incorporates many inputs, there may be a number of different variables that each contribute to the issue.

Within the formulaic package, the create.formula’s reduce parameter can be used to automatically identify categorical variables with a lack of contrast. When reduce = TRUE and a data set is provided, inputs with no variation are excluded from the resulting formula. The exclude.lack.contrast column of the output’s inclusion.table identifies which variables include a lack of contrast, and the min.categories column identifies the number of unique values for each variable. This is demonstrated with the call to create.formula below:



formula.awareness <-
  create.formula(
    outcome.name = awareness.name,
    input.names = c(age.group.name, gender.name),
    dat = snack.dat[get(age.group.name) == "[ 18, 35)", ],
    reduce = TRUE
  )

print(formula.awareness)

A Lack of Contrast within Subsets of the Data

Due to the snack.dat’s series of survey questions, many of the measured variables for a brand are recorded downstream from the initial question about the respondent’s awareness. These questions are only asked to the respondents who indicate awareness. As shown previously, the values of consideration (1 or 0) only occur when awareness is equal to 1. Across the full range of the data, the consideration variable includes multiple values and exhibits variation. However, within the subgroup of respondents who are not aware of the specific product, all of the values are missing. Due to this structurally missing design, it can be necessary to search for a lack of contrast within subsets of the outcome variable. The create.formula function allows the users to specify the max.outcome.categories.to.search. When the number of unique values of the outcome is less than or equal to the value of max.outcome.categories.to.search, a data set is provided, and reduce = TRUE, then the search for a lack of contrast is extended into the subsets based on the outcome variable.

As an example, consider a model of consideration that attempts to utilize awareness as an input. The consideration outcome has two unique measured values (1 and 0). If max.outcome.categories.to.search = 1, then the subgroups of consideration will not be searched for a lack of contrast. Instead, the only quality check related to variation will examine each variable for a global lack of contrast. In the case of awareness, it exhibits variation at the global level with binary outcomes. This selection is depicted below:

formula.consideration.1 <-
  create.formula(
    outcome.name = consideration.name,
    input.names = c(age.group.name, gender.name, awareness.name),
    dat = snack.dat,
    reduce = TRUE,
    max.outcome.categories.to.search = 1
  )

print(formula.consideration.1)
#> $formula
#> Consideration ~ `Age Group` + Gender + Awareness
#> <environment: 0x000000001f118c48>
#> 
#> $inclusion.table
#>     variable exclude.null.quantity   class order specified.from
#> 1: Age Group                 FALSE  factor     1    input.names
#> 2:    Gender                 FALSE  factor     2    input.names
#> 3: Awareness                 FALSE integer     3    input.names
#>    exclude.user.specified exclude.matches.outcome.name min.categories
#> 1:                  FALSE                        FALSE              4
#> 2:                  FALSE                        FALSE              2
#> 3:                  FALSE                        FALSE              2
#>    exclude.lack.contrast exclude.numerous.categories include.variable
#> 1:                 FALSE                       FALSE             TRUE
#> 2:                 FALSE                       FALSE             TRUE
#> 3:                 FALSE                       FALSE             TRUE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

However, if max.outcome.categories.to.search >= 2, then the consideration variable would qualify as having sufficiently few unique values. Then each subset would subsequently be searched for a lack of contrast in each of the possible inputs. When consideration is 1 or 0, the awareness variable is always 1. Therefore, the inclusion.table’s calculation of the min.categories will be reduced from 2 (in the prior example) to 1 (below). As a result, the exclude.lack.contrast entry for the awareness variable will be flipped from FALSE to TRUE, and awareness will be removed from the formula.

formula.consideration.2 <-
  create.formula(
    outcome.name = consideration.name,
    input.names = c(age.group.name, gender.name, awareness.name),
    dat = snack.dat,
    reduce = TRUE,
    max.outcome.categories.to.search = 2
  )

print(formula.consideration.2)
#> $formula
#> Consideration ~ `Age Group` + Gender
#> <environment: 0x000000002013d7c8>
#> 
#> $inclusion.table
#>     variable exclude.null.quantity   class order specified.from
#> 1: Age Group                 FALSE  factor     1    input.names
#> 2:    Gender                 FALSE  factor     2    input.names
#> 3: Awareness                 FALSE integer     3    input.names
#>    exclude.user.specified exclude.matches.outcome.name min.categories
#> 1:                  FALSE                        FALSE              4
#> 2:                  FALSE                        FALSE              2
#> 3:                  FALSE                        FALSE              1
#>    exclude.lack.contrast exclude.numerous.categories include.variable
#> 1:                 FALSE                       FALSE             TRUE
#> 2:                 FALSE                       FALSE             TRUE
#> 3:                  TRUE                       FALSE            FALSE
#> 
#> $interactions.table
#> Empty data.table (0 rows and 2 cols): interactions,include.interaction

A Large Volume of Levels in a Categorical Variable

As previously discussed, a statistical model that incorporates categorical variables with k > 1 unique values will code k-1 separate columns of indicator variables. Variables displaying user-generated text or unique identifiers may have unique values in all or nearly all of the rows of the data set. Large values of k in a single variable can create computational burdens or lead to intractable structures. Models with such a large number of additional columns may run nearly interminably without any indication of the underlying issue or an estimate of the time to completion.

To avoid this issue, formulaic’s create.formula function allows the user to specify the max.input.categories. Each categorical variable’s number of levels k is computed at a global level. Any such variable with a value of k greater than max.input.categories is automatically excluded from consideration. This shows up in the calculation of the min.categories value and subsequently the exclude.numerous.categories of the inclusion.table.

As an example, the snack.dat’s User ID variable is a character vector that indicates which of the 1000 respondents supplied the answers for the given row. Including the User ID in a model would therefore generate 999 columns of indicator variables. When reduce = TRUE, a data set is supplied, and max.input.categories is set at a value below 1000, then the User ID would be automatically excluded from the formula:

create.formula(
  outcome.name = satisfaction.name,
  input.names = c(age.name, income.name, region.name, id.name),
  dat = snack.dat,
  reduce = TRUE,
  max.input.categories = 30
)$formula
#> Satisfaction ~ Age + Income + Region
#> <environment: 0x0000000021337130>

Inspection of All Variables

When reduce = TRUE and a data set is supplied, the create.formula function provides a range of quality checks and information about the merits of including specific variables as possible inputs in a model. From the list of all of the variables, a user can quickly identify a reduced list for potential inclusion. As an example, we use the snack.dat to show that a model of awareness would need to be limited to a subset of the overall variables:

create.formula(
  outcome.name = income.name,
  input.names = ".",
  reduce = TRUE,
  dat = snack.dat,
  max.input.categories = 30
)$formula
#> Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
#> Income ~ Age + Gender + Region + Persona + Product + Awareness + 
#>     BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x000000001f26f268>

All of the brand perceptions and other states of engagement were removed from the formula. This was due to a lack of contrast arising from the structurally missing values when the respondents were not aware of the product. Meanwhile, the User ID was removed due to its large number of categories. Only the names of the products and the respondent-specific variables remain. From this list, an investigator could then make selections of which variables to include (e.g. Age or Age Group). However, much of the preliminary investigation would be handled automatically. This is especially helpful in settings in which the full relationship of the variables – such as the sequence and dependencies of the marketing survey’s questions – is not yet fully understood.

Transformation

Rather than looking for the names of existing variables, the updated create.formula evaluates the inputs and outcomes to see if real results are generated. With the changes, users can set any tranformed inputs such as sqrt(Age^2 * log(Income)) for inclusion. Likewise, the outcomes can also be transformations like log(Income).


create.formula(
  outcome.name = income.name,
  input.names = c(region.name, gender.name, sprintf("sqrt(%s^2) * log(%s)", age.name, income.name), income.group.name, "ldkao"),
  reduce = TRUE,
  interactions = list(c(gender.name, income.group.name)),
  input.patterns = NULL,
  force.main.effects = TRUE,
  max.input.categories = 20,
  max.outcome.categories.to.search = 4,
  order.as = "as.specified",
  include.backtick = "as.needed",
  format.as = "formula",
  variables.to.exclude = NULL,
  include.intercept = TRUE,
  dat = snack.dat
)$formula
#> Income ~ Region + Gender + sqrt(Age^2) * log(Income) + `Income Group` + 
#>     Gender * `Income Group`
#> <environment: 0x0000000020048a10>


res <- create.formula(outcome.name = outcome.name, input.names = input.names, interactions = interactions, dat = snack.dat, reduce = reduce)


res <- create.formula(outcome.name = awareness.name, input.names = input.names, interactions = interactions, dat = snack.dat, reduce = TRUE)


glm(formula = res$formula, data = snack.dat, family = "binomial")
#> 
#> Call:  glm(formula = res$formula, family = "binomial", data = snack.dat)
#> 
#> Coefficients:
#>           (Intercept)                    Age             GenderMale  
#>             2.166e-01             -1.879e-03             -1.786e-01  
#>                Income         Age:GenderMale             Age:Income  
#>            -4.874e-07              2.512e-03              1.633e-08  
#>     GenderMale:Income  Age:GenderMale:Income  
#>             9.725e-07             -2.824e-08  
#> 
#> Degrees of Freedom: 22999 Total (i.e. Null);  22992 Residual
#> Null Deviance:       31820 
#> Residual Deviance: 31810     AIC: 31830

res
#> $formula
#> Awareness ~ Age + Gender + Income + Age * Gender + Age * Income + 
#>     Age * Gender * Income
#> <environment: 0x0000000020010ee0>
#> 
#> $inclusion.table
#>    variable exclude.null.quantity   class order specified.from
#> 1:      Age                 FALSE integer     1    input.names
#> 2:     Typo                  TRUE    <NA>     2    input.names
#> 3:   Gender                 FALSE  factor     3   interactions
#> 4:   Income                 FALSE numeric     4   interactions
#> 5:     Inco                  TRUE    <NA>     5   interactions
#> 6:  Reg ion                  TRUE    <NA>     6   interactions
#>    exclude.user.specified exclude.matches.outcome.name min.categories
#> 1:                  FALSE                        FALSE             75
#> 2:                     NA                        FALSE             NA
#> 3:                  FALSE                        FALSE              2
#> 4:                  FALSE                        FALSE            138
#> 5:                     NA                        FALSE             NA
#> 6:                     NA                        FALSE             NA
#>    exclude.lack.contrast exclude.numerous.categories include.variable
#> 1:                 FALSE                       FALSE             TRUE
#> 2:                    NA                          NA            FALSE
#> 3:                 FALSE                       FALSE             TRUE
#> 4:                 FALSE                       FALSE             TRUE
#> 5:                    NA                          NA            FALSE
#> 6:                    NA                          NA            FALSE
#> 
#> $interactions.table
#>             interactions include.interaction
#> 1:          Age * Gender                TRUE
#> 2:          Age * Income                TRUE
#> 3: Age * Gender * Income                TRUE
#> 4:         Gender * Inco               FALSE
#> 5:         Age * Reg ion               FALSE

Reducing an Existing Formula (reduce.existing.formula):

The reduce.existing.formula function was designed to perform quality checks and automatic removal of impractical variables can also be accessed when an existing formula has been previously constructed. This method uses natural language processing techniques to deconstruct the components of a formula. Each variable and interaction is separately identified and aggregated. These variables are then supplied to create.formula as the input.names and interactions parameters. Otherwise, the parameters of reduce.existing.formula are designed to match those of create.formula. As a result, an initial formula can be evaluated in terms of the same set of quality checks, and the formula can be reduced based on the same set of exclusions.

Parameter description:

the.initial.formula object of class “lm” or for multiple responses of class c(“mlm”, “lm”).
dat Data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model.
max.input.categories This limits the maximum number of variables that will be employed in the formula. As default it is set at 20, but users can still change at his/her convenience.
max.outcome.categories.to.search This limits the maximum number of outcome categories will be investigated in the formula. As default it is set at 4, but users can still change at his/her convenience
order.as rearranges its first argument into ascending or descending order.
include.backtick Add backticks to make a appropriate variable
format.as The data type of the output. If not set as “formula”, then a character vector will be returned.

As an example, we will demonstrate that a user-supplied formula will produce the same results as that created in the previous section:

the.initial.formula <- 'Income ~ .'

reduce.existing.formula(
  the.initial.formula = the.initial.formula,
  dat = snack.dat,
  max.input.categories = 30
)$formula
#> Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
#> Income ~ Age + Gender + Region + Persona + Product + Awareness + 
#>     BP_For_Me_0_10 + BP_Fits_Budget_0_10 + BP_Tastes_Great_0_10 + 
#>     BP_Good_To_Share_0_10 + BP_Like_Logo_0_10 + BP_Special_Occasions_0_10 + 
#>     BP_Everyday_Snack_0_10 + BP_Healthy_0_10 + BP_Delicious_0_10 + 
#>     BP_Right_Amount_0_10 + BP_Relaxing_0_10 + Consideration + 
#>     Consumption + Satisfaction + Advocacy + `Age Group` + `Income Group`
#> <environment: 0x00000000169c8e48>

Introduction to Formulaic

Authors: David Shilane, Caffrey Lee, Zoe Huang, Anderson Nelson

2021-02-15

Introduction

Dynamic Generation of a Formula

Dataset (snack.dat)

Adding Backticks (add.backtick)

Creating Formula (create.formula):

Parameter description:

Basic format

Creating Interactions of Variables

Selecting Variables from Patterns

Selecting All of the Variables

Removing Specific Variables

Quality Checks

Outcomes as Inputs

Removing duplicated variables

Misspecified Variables

Considerations for Feature Engineering

A Lack of Contrast

Numeric Variables With No Variation

Categorical Variables With No Variation

A Lack of Contrast within Subsets of the Data

A Large Volume of Levels in a Categorical Variable

Inspection of All Variables

Transformation

Reducing an Existing Formula (reduce.existing.formula):

Parameter description: