Introduction to archetyper

2021-03-15

library(archetyper)

‘archetyper’ initializes data mining and data science projects by generating common workflow components as well as peripheral files needed to support technical best practices:

The lifecyle of a data mining project generally includes the following components:

Additionally, a well-formed data mining project will include:

Generating a new project with generate():

generate will create a new project with the files and directories to support the data mining and data science workflow.

generate("majestic_12")

list.files("majestic_12")
[1] "data_input"        "data_output"       "data_working"      "docs"              "majestic_12.Rproj"
[6] "models"            "R"                 "readme.md"      ".gitignore"        

The R code for the data workflow will be in the R/ directory.

list.files("majestic_12/R/")

 [1] "0_test.R"      "1_integrate.R" "2_enrich.R"    "3_model.R"    
 [5] "4_evaluate.R"  "5_present.Rmd" "api.R"         "common.R"     
 [9] "explore.R"     "lint.R"        "mediator.R"    "utilities.R"  

The base work-flow files include integrate.R, enrich.R, model.R, evaluate.R, present.Rmd, and api.R.

Additional files are created to serve supporting functions:

A directory structure designed to logically separate the data artifacts produced throughout the work-flow is also generated by the ‘archetyper’ package. These directories include:

For traceability, files and objects (e.g. models) throughout the project are named according to a standard naming convention.

[ project_name ]_[ file_name ]_[ YYYY_MM_DD_HH:MM ].[ file_extension ]

This structure, in conjunction with the persistent state of each component, allows each component script to be run independently without sourcing all the preceding components.

Database connections

A database connection type of “odbc” or “jdbc” can be passed in as a function argument to generate scaffolding helpful for database connections.

ODBC

A db_connection argument of ‘odbc’ will generate a connection code snippet in the integrate.R file.

library(odbc)
con <- dbConnect(odbc::odbc(), "dev_database")
sql <- "select my_value from my_table"
result_df <- dbGetQuery(con, sql)

A file to store the database DML and DDL (for data preparation occurring in the database prior to the integration step) is additionally generated when using ‘odbc’:

Note that when using odbc, the user must update the appropriate odbc configuration files (e.g. odbcinst.ini, odbc.ini)

JDBC

The following files and directories will be created with a ‘jdbc’ argument:

A db_connection argument of ‘jdbc’ will additionally generate a connection code snippet in the integrate.R file. Note that the credentials are sourced from the config.yml file so that they are not exposed in the source code. The config.yml file is ignored in the .gitignore file so that it is not committed to source.

library(RJDBC)
db_credentials <- config::get("dev_database")
drv <- RJDBC::JDBC(driverClass = db_credentials$driver_class, classPath =  Sys.glob("drivers/*"))
con <- dbConnect(drv,db_credentials$connection_string, db_credentials$username, db_credentials$password)
sql <- "select my_value from my_table"
result_df <- dbGetQuery(con, sql)

When using a JDBC connection, the user must provide appropriate driver JARs in the drivers/ directory, as well as user credentials, class path, and connection string in the config.yml file.

generate("majestic_12", db_connection_type = 'jdbc')
list.files(project_path)

 [1] "config.yml"        "data_input"        "data_output"      
 [4] "data_working"      "docs"              "drivers"          
 [7] "majestic_12.Rproj" "models"            "R"                
[10] "readme.md"         ".gitignore"        "dml_ddl.sql"

Excluding components

The exclude argument prevents specified files from being generated.

generate(project_name = project_name, path = project_directory, exclude = c("api.R", "utilities.R", "readme.md", "lint.R", ".gitignore"))

list.files(project_path)
[1] "data_input"        "data_output"       "data_working"      "docs"              "majestic_12.Rproj"
[6] "models"            "R"
list.files(project_path_r)
 [1] "0_test.R"      "1_integrate.R" "2_enrich.R"    "3_model.R"     "4_evaluate.R"  "5_present.Rmd"
 [7] "common.R"      "explore.R"     "mediator.R"

Generating the demo project

The ‘archetyper’ project is pre-packaged with a working demo project that predicts hospital readmission rates based on publicly-available structural characteristics and complication rates.

The demo project can be generated by running the generate_demo() function.

archetyper::generate_demo()
list.files("hospital_readmissions_demo/")
[1] "data_input"        "data_output"       "data_working"      "docs"              "hospital_readmissions_demo.Rproj"
[6] "models"            "R"                 "readme.md"        ".gitignore"

Once the demo project has been created, the project should be opened in RStudio.

Running the demo workflow

The the full data-mining/data-science life-cycle can be triggered by executing the mediator.R file. The contents of the mediator.R file is below:

cat(readChar("hospital_readmissions_demo/R/mediator.R"), 1e5))

##--------------------------------------------------------------------------
##  The mediator file will execute the linear data processing work-flow.   -
##--------------------------------------------------------------------------

source("R/common.R")
tryCatch({
    info(logger, "running tests...")
    source("R/0_test.R")
    info(logger, "gathering and integrating data...")
    source("R/1_integrate.R")
    info(logger, "enriching base data...")
    source("R/2_enrich.R")
    info(logger, "building model(s)...")
    source("R/3_model.R")
    info(logger, "applying model(s) to test partitions...")
    source("R/4_evaluate.R")
    info(logger, "building presentation materials...")
    rmarkdown::render("R/5_present.Rmd", "pdf_document", output_dir = "docs")
    info(logger, "workflow is complete.")

  },
  error = function(cond) {
    log4r::error(logger, str_c("Script error: ", cond))
  }
)

Note that the file includes a centralized logger to distinguish levels of severity (using the ‘log4r’ package) as well as relative directory references (using the ‘here’ package). Comments in all files were created using the ‘bannerCommenter’ package.

Integration

A set of publicly-available data files will be loaded by the integration.R file. The integration step joins and transforms the source files according to Tidy Data principles, and persists the integrated data into the data_working/ directory.

> list.files("data_working/")
[1] "hospital_readmissions_integrated_2021-02-25.feather"    

Enrichment

The enrichment step creates features better suited for the modeling process by applying feature engineering methods (such as scaling and centering the numeric features), outlier removal (using Cook’s Distance), imputation (using predictive mean matching PMM), and feature selection (by removing highly-correlated features). Additionally, training labels are assigned through stratified random sampling. The enriched results are stored in feather format in the data_working/ directory.

> list.files("data_working/")
[1] "hospital_readmissions_enriched_2021-02-25.feather"       "hospital_readmissions_integrated_2021-02-25.feather" 

Modeling

In the modeling step, a stepwise linear regression is applied to the training partition of the enriched dataset. The model coefficients, performance statistics, and the model itself is stored in the data_output/ and models/ directories (respectively).

> list.files("data_ouput/")
[1] "hospital_readmissions_feature_dtl_2021-02-25.csv"   "hospital_readmissions_perf_2021-02-25.csv"

> list.files("mod/")
[1] "hospital_readmissions_readmissions_2021-02-23.mod"

Evaluation

In the evaluation step, the testing partitions of the enriched dataset are applied to the trained model from the models/ directory. The testing data with the appended predictions, along with performance statistics from the testing dataset, are stored in the data_output/ directory as a .csv file.

> list.files("data_ouput/")
[1] "hospital_readmissions_feature_dtl_2021-02-25.csv"   "hospital_readmissions_holdout_perf_stats_2021-02-25.csv"
[3] "hospital_readmissions_perf_2021-02-25.csv"    "hospital_readmissions_testing_w_predictions_2021-02-25.csv"

Presentation

Finally, an R Markdown report is produced, using files sourced from the data_output/ directory.

#Present
> list.files("hospital_readmissions_demo/docs/")
[1] "5_present.pdf"

Deployment

The api.R file generates a sample RESTful api that uses the trained model from the models/ directory. The demo api can be called with the below sample request body:

{
  "dt": {"state": "AL",
    "hospital_type": "acute_care_hospitals",
    "hospital_ownership": "government_hospital_district_or_authority",
    "emergency_services": "Yes",
    "ehr_interop": "Y",
    "denominator": 1.4791,
    "PSI_10": -1.3895,
    "PSI_11": 0.597,
    "PSI_12": -0.9487,
    "PSI_13": -0.102,
    "PSI_14": -1.1131,
    "PSI_15": -0.9439,
    "PSI_3": 0.029,
    "PSI_6": -0.5752,
    "PSI_8": -1.4081,
    "PSI_9": -0.3028,
    "denominator_ln": 1.3489}
}

Note that the demo project was designed simply to illustrate the the functionality of the ‘archetyper’ project. It was not designed to be a production or publish worthy model.