In this should example we will showcase the pca_your_recipe() function. This function takes only a few arguments. The arguments are currently .data which is the full data set that gets passed internally to the recipes::bake() function, .recipe_object which is a recipe you have already made and want to pass to the function in order to perform the pca, and finally .threshold which is the fraction of the variance that should be captured by the components.

Libraries

library(timetk)
library(dplyr)
library(purrr)
library(healthyR.data)
library(rsample)
library(recipes)
library(ggplot2)
library(plotly)

Data

Now that we have out libraries we can go ahead and get our data set ready.

Data Set

data_tbl <- healthyR_data %>%
    select(visit_end_date_time) %>%
    summarise_by_time(
        .date_var = visit_end_date_time,
        .by       = "month",
        value     = n()
    ) %>%
    set_names("date_col","value") %>%
    filter_by_time(
        .date_var = date_col,
        .start_date = "2013",
        .end_date = "2020"
    ) %>%
    mutate(date_col = as.Date(date_col))

head(data_tbl)
#> # A tibble: 6 × 2
#>   date_col   value
#>   <date>     <int>
#> 1 2013-01-01  2082
#> 2 2013-02-01  1719
#> 3 2013-03-01  1796
#> 4 2013-04-01  1865
#> 5 2013-05-01  2028
#> 6 2013-06-01  1813

The data set is simple and by itself would not be at all useful for a pca analysis since there is only one predictor, being time. In order to facilitate the use of the function and this example, we will create a splits object and a recipe object.

Splits

splits <- initial_split(data = data_tbl, prop = 0.8)

splits
#> <Training/Testing/Total>
#> <76/19/95>

head(training(splits))
#> # A tibble: 6 × 2
#>   date_col   value
#>   <date>     <int>
#> 1 2019-10-01  1525
#> 2 2014-12-01  1757
#> 3 2017-08-01  1607
#> 4 2019-07-01  1474
#> 5 2020-07-01  1056
#> 6 2013-03-01  1796

Initial Recipe

rec_obj <- recipe(value ~ ., training(splits)) %>%
    step_timeseries_signature(date_col) %>%
    step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)"))

rec_obj
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 1
#> 
#> ── Operations
#> • Timeseries signature features from: date_col
#> • Variables removed: matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")

get_juiced_data(rec_obj) %>% glimpse()
#> Rows: 76
#> Columns: 20
#> $ date_col           <date> 2019-10-01, 2014-12-01, 2017-08-01, 2019-07-01, 20…
#> $ value              <int> 1525, 1757, 1607, 1474, 1056, 1796, 1628, 1651, 130…
#> $ date_col_index.num <dbl> 1569888000, 1417392000, 1501545600, 1561939200, 159…
#> $ date_col_year      <int> 2019, 2014, 2017, 2019, 2020, 2013, 2015, 2018, 201…
#> $ date_col_half      <int> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, …
#> $ date_col_quarter   <int> 4, 4, 3, 3, 3, 1, 1, 3, 4, 3, 4, 2, 2, 2, 4, 3, 4, …
#> $ date_col_month     <int> 10, 12, 8, 7, 7, 3, 3, 7, 11, 7, 10, 5, 5, 4, 12, 9…
#> $ date_col_month.lbl <ord> October, December, August, July, July, March, March…
#> $ date_col_day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ date_col_wday      <int> 3, 2, 3, 2, 4, 6, 1, 1, 6, 3, 7, 4, 6, 2, 7, 2, 5, …
#> $ date_col_wday.lbl  <ord> Tuesday, Monday, Tuesday, Monday, Wednesday, Friday…
#> $ date_col_mday      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ date_col_qday      <int> 1, 62, 32, 1, 1, 60, 60, 1, 32, 1, 1, 31, 31, 1, 62…
#> $ date_col_yday      <int> 274, 335, 213, 182, 183, 60, 60, 182, 305, 182, 275…
#> $ date_col_mweek     <int> 5, 6, 6, 6, 5, 5, 4, 5, 5, 5, 5, 5, 5, 6, 5, 6, 5, …
#> $ date_col_week      <int> 40, 48, 31, 26, 27, 9, 9, 26, 44, 26, 40, 18, 18, 1…
#> $ date_col_week2     <int> 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, …
#> $ date_col_week3     <int> 1, 0, 1, 2, 0, 0, 0, 2, 2, 2, 1, 0, 0, 1, 0, 2, 1, …
#> $ date_col_week4     <int> 0, 0, 3, 2, 3, 1, 1, 2, 0, 2, 0, 2, 2, 1, 0, 3, 0, …
#> $ date_col_mday7     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Now that we have out initial recipe we can use the pca_your_recipe() function.

pca_list <- pca_your_recipe(
  .recipe_object = rec_obj,
  .data          = data_tbl,
  .threshold     = 0.8,
  .top_n         = 5
)
#> Warning: !  The following columns have zero variance so scaling cannot be used:
#>   date_col_day, date_col_mday, and date_col_mday7.
#> ℹ Consider using ?step_zv (`?recipes::step_zv()`) to remove those columns
#>   before normalizing.

Inspect PCA Output

The function returns a list object and does so insvisible so you must assign the output to a variable, you can then access the items of the list in the usual manner.

The following items are included in the output of the function:

pca_transform - This is the pca recipe.
variable_loadings
variable_variance
pca_estimates
pca_juiced_estimates
pca_baked_data
pca_variance_df
pca_variance_scree_plt
pca_rotation_df

Lets start going down the list of items.

PCA Transform

This is the portion you will want to output to a variable as this is the recipe object itself that you will use further down the line of your work.

pca_rec_obj <- pca_list$pca_transform

pca_rec_obj
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 1
#> 
#> ── Operations
#> • Timeseries signature features from: date_col
#> • Variables removed: matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")
#> • Centering for: recipes::all_numeric()
#> • Scaling for: recipes::all_numeric()
#> • Sparse, unbalanced variable filter on: recipes::all_numeric()
#> • PCA extraction with: recipes::all_numeric_predictors()

Variable Loadings

pca_list$variable_loadings
#> # A tibble: 169 × 4
#>    terms                 value component id       
#>    <chr>                 <dbl> <chr>     <chr>    
#>  1 date_col_index.num -0.0665  PC1       pca_ZHMem
#>  2 date_col_year      -0.0124  PC1       pca_ZHMem
#>  3 date_col_half      -0.386   PC1       pca_ZHMem
#>  4 date_col_quarter   -0.433   PC1       pca_ZHMem
#>  5 date_col_month     -0.436   PC1       pca_ZHMem
#>  6 date_col_wday       0.0247  PC1       pca_ZHMem
#>  7 date_col_qday      -0.0497  PC1       pca_ZHMem
#>  8 date_col_yday      -0.436   PC1       pca_ZHMem
#>  9 date_col_mweek     -0.00376 PC1       pca_ZHMem
#> 10 date_col_week      -0.436   PC1       pca_ZHMem
#> # ℹ 159 more rows

Variable Variance

pca_list$variable_variance
#> # A tibble: 52 × 4
#>    terms       value component id       
#>    <chr>       <dbl>     <int> <chr>    
#>  1 variance 5.17             1 pca_ZHMem
#>  2 variance 2.03             2 pca_ZHMem
#>  3 variance 1.56             3 pca_ZHMem
#>  4 variance 1.28             4 pca_ZHMem
#>  5 variance 1.21             5 pca_ZHMem
#>  6 variance 0.638            6 pca_ZHMem
#>  7 variance 0.564            7 pca_ZHMem
#>  8 variance 0.488            8 pca_ZHMem
#>  9 variance 0.0581           9 pca_ZHMem
#> 10 variance 0.000227        10 pca_ZHMem
#> # ℹ 42 more rows

PCA Estimates

pca_list$pca_estimates
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 1
#> 
#> ── Training information
#> Training data contained 76 data points and no incomplete rows.
#> 
#> ── Operations
#> • Timeseries signature features from: date_col | Trained
#> • Variables removed: date_col_year.iso date_col_month.xts, ... | Trained
#> • Centering for: value, date_col_index.num, date_col_year, ... | Trained
#> • Scaling for: value, date_col_index.num, date_col_year, ... | Trained
#> • Sparse, unbalanced variable filter removed: date_col_day, ... | Trained
#> • PCA extraction with: date_col_index.num date_col_year, ... | Trained

Jucied and Baked Data

pca_list$pca_juiced_estimates %>% glimpse()
#> Rows: 76
#> Columns: 9
#> $ date_col           <date> 2019-10-01, 2014-12-01, 2017-08-01, 2019-07-01, 20…
#> $ value              <dbl> -0.1102009, 0.7893652, 0.2077492, -0.3079504, -1.92…
#> $ date_col_month.lbl <ord> October, December, August, July, July, March, March…
#> $ date_col_wday.lbl  <ord> Tuesday, Monday, Tuesday, Monday, Wednesday, Friday…
#> $ PC1                <dbl> -2.7458635, -3.4270852, -0.7568537, -1.0440634, -0.…
#> $ PC2                <dbl> -1.565171740, 1.510429529, 0.204199388, -0.87311426…
#> $ PC3                <dbl> -0.0001972329, -0.6150954024, 2.3018101359, 2.76583…
#> $ PC4                <dbl> 1.74777239, -0.32725317, -1.11607242, 1.10893994, -…
#> $ PC5                <dbl> -0.07863831, -2.31569777, -0.53758842, -0.37896107,…

pca_list$pca_baked_data %>% glimpse()
#> Rows: 95
#> Columns: 9
#> $ date_col           <date> 2013-01-01, 2013-02-01, 2013-03-01, 2013-04-01, 20…
#> $ value              <dbl> 2.0495332, 0.6420225, 0.9405853, 1.2081287, 1.84015…
#> $ date_col_month.lbl <ord> January, February, March, April, May, June, July, A…
#> $ date_col_wday.lbl  <ord> Tuesday, Friday, Friday, Monday, Wednesday, Saturda…
#> $ PC1                <dbl> 3.4030132, 2.9907161, 2.6285260, 1.9099228, 1.07026…
#> $ PC2                <dbl> 2.2966037, 1.9992129, 1.5477109, 2.3559201, 1.55679…
#> $ PC3                <dbl> 0.6290209, -0.6557058, -2.1183589, 0.7861542, -1.58…
#> $ PC4                <dbl> 1.2683326, 0.2364059, -1.0048467, 1.1701121, -0.161…
#> $ PC5                <dbl> -0.88429703, 1.47350272, -0.05630354, -1.20416386, …

Roatation Data

pca_list$pca_rotation_df %>% glimpse()
#> Rows: 13
#> Columns: 13
#> $ PC1  <dbl> -0.066493201, -0.012379375, -0.386498673, -0.433405887, -0.435723…
#> $ PC2  <dbl> -0.674160359, -0.681686543, 0.075236177, 0.028032622, 0.024532164…
#> $ PC3  <dbl> 0.18192975, 0.18480670, 0.21781743, 0.04525416, -0.01460549, -0.2…
#> $ PC4  <dbl> -0.006537556, 0.005703034, 0.003132411, 0.088662408, -0.098568289…
#> $ PC5  <dbl> -0.007972229, -0.007170887, 0.154342827, 0.031146186, -0.00861421…
#> $ PC6  <dbl> -0.02622017, -0.02367719, -0.26195169, -0.01115824, -0.01983138, …
#> $ PC7  <dbl> -0.03150548, -0.02642184, -0.11599004, -0.03222366, -0.03972107, …
#> $ PC8  <dbl> -0.0610268588, -0.0608669252, 0.2228511579, 0.1218330341, -0.0003…
#> $ PC9  <dbl> -0.01279068, 0.01449010, 0.80120634, -0.27505382, -0.21482529, 0.…
#> $ PC10 <dbl> 0.014209863, -0.012122646, -0.001798115, 0.300512705, 0.373009670…
#> $ PC11 <dbl> -0.0259374659, 0.0265664980, -0.0015883283, -0.0391562630, 0.6626…
#> $ PC12 <dbl> -5.325697e-03, 5.040297e-03, -3.156335e-03, -7.847461e-01, 4.1619…
#> $ PC13 <dbl> -7.081036e-01, 7.034637e-01, 8.774929e-05, 3.414925e-02, 1.310318…

Variance and Scree Plot

pca_list$pca_variance_df %>% glimpse()
#> Rows: 13
#> Columns: 6
#> $ PC              <chr> "PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8"…
#> $ var_explained   <dbl> 3.979029e-01, 1.561310e-01, 1.201252e-01, 9.838223e-02…
#> $ var_pct_txt     <chr> "39.79%", "15.61%", "12.01%", "9.84%", "9.29%", "4.91%…
#> $ cum_var_pct     <dbl> 0.3979029, 0.5540339, 0.6741591, 0.7725413, 0.8654737,…
#> $ cum_var_pct_txt <chr> "39.79%", "55.40%", "67.42%", "77.25%", "86.55%", "91.…
#> $ ou_threshold    <fct> Under, Under, Under, Under, Over, Over, Over, Over, Ov…

pca_list$pca_variance_scree_plt

Variable Loading Plots

pca_list$pca_loadings_plt


pca_list$pca_top_n_loadings_plt

Getting Started with healthyR.ai

A Quick Introduction

Steven P. Sanderson II, MPH

2025-04-22