---
title: "Introduction to summarytabl"
output: rmarkdown::html_vignette
toc: true
toc_depth: 2
description: >
  This document introduces you to some of summarytabl's most frequently used functions, and demonstrates how you can use them with data frames.
vignette: >
  %\VignetteIndexEntry{Introduction to summarytabl}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Overview

**summarytabl** is an R package designed to simplify the creation of summary tables for different types of data. It provides a set of functions that help you quickly describe:

* Categorical variables
* Multiple response variables
* Continuous variables

Each function is clearly prefixed based on the type of data it summarizes, making it easy to identify and apply the right tool for your analysis.

Use these functions to summarize binary and nominal variables:

* `cat_tbl()` creates a summary table for a categorical variable.
* `cat_group_tbl()` summarizes two categorical variables.

These functions are ideal for summarizing binary, ordinal, and Likert-scale variables in which respondents select one response per statement, question, or item:

* `select_tbl()` summarizes multiple response and ordinal variables.
* `select_group_tbl()` summarizes multiple response and ordinal variables by a group or pattern.

For interval and ratio-level variables, use:

* `mean_tbl()` generates summary statistics for continuous variables.
* `mean_group_tbl()` generates summary statistics for continuous variables by group or pattern.

All functions work with data frames and tibbles, and each returns a tibble as output.

This document is organized into three sections, each focusing on a different set of functions for summarizing a specific type of variable.

To begin working with **summarytabl**, load the package:

```{r setup}
library(summarytabl)
```

Keep reading to learn more about how each function works, or jump to the section that matches the type of variable or data you're working with.

## Working with categorical variables

Let's explore how to use `cat_tbl()` and `cat_group_tbl()` to summarize categorical variables. We'll begin by summarizing a single categorical variable, `race`, from the `nlsy` dataset.

```{r}
cat_tbl(data = nlsy, var = "race")
```

The function returns a tibble with three columns by default: 

* `race`: the name of the variable being summarized
* `count`: the number of observations in each category of `race`
* `percent`: the percentage of observations in each category of `race`, calculated relative to the total

You can exclude certain values and eliminate missing values from the data using the `ignore` and `na.rm` arguments, respectively.

```{r}
cat_tbl(data = nlsy, 
        var = "race",
        ignore = "Hispanic",
        na.rm = TRUE)
```

Suppose we want to create a contingency table to summarize two categorical variables. We can do this using the `cat_group_tbl()` function. In this example, we summarize `race` by `bthwht`. Before applying `cat_group_tbl()`, we'll recode the values of `bthwht`, changing `0` to `regular_birthweight` and `1` to `low_birthweight`.

```{r}
nlsy_cross_tab <- 
  nlsy |>
  dplyr::select(c(race, bthwht)) |>
  dplyr::mutate(bthwht = ifelse(bthwht == 0, "regular_bithweight", "low_birthweight")) 

cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht")
```

The function returns a tibble with four columns by default: 

* `race`: the name of the `row_var` variable
* `bthwht`: the name of the `col_var` variable
* `count`: the number of observations for each combination of `race` and `bthwht` categories.
* `percent`: the percentage of observations for each combination of `race` and `bthwht` categories, calculated relative to the total

To pivot the output to the wide format, set `pivot = "wider"`. 

```{r}
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider")
```

To display only percentages, set `only = "percent"`. You can also control how those percentages are calculated and displayed using the `margins` argument.

```{r}
# Default: percentages across the full table sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              pivot = "wider",
              only = "percent")

# Rowwise: percentages sum to one across columns within each row
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "rows",
              pivot = "wider",
              only = "percent")

# Columnwise: percentages within each column sum to one
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              margins = "columns",
              pivot = "wider",
              only = "percent")
```  

Sometimes, you may want to exclude specific values from your analysis. To do this, use a named vector or list to specify which values to exclude from the `row_var` and `col_var` variables. For example, in the case below, the `Non-Black/Non-Hispanic` category is excluded from the race variable (i.e., `row_var`) and to ensure that NAs are not returned in the final table, `na.rm.row_var` is set to `TRUE`.

```{r}
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = c(race = "Non-Black,Non-Hispanic"))
```    

When you need to exclude more than one value from `row_var` or `col_var`, use a named list. In the example below, both the `Non-Black/Non-Hispanic` and `Hispanic` categories are excluded from the race variable.

```{r}
cat_group_tbl(data = nlsy_cross_tab,
              row_var = "race",
              col_var = "bthwht",
              na.rm.row_var = TRUE,
              ignore = list(race = c("Non-Black,Non-Hispanic", "Hispanic")))
```  

## Working with multiple response and ordinal variables

Next, let's explore how to use `select_tbl()` and `select_group_tbl()` functions to summarize multiple response and ordinal variables. Multiple response and ordinal variables are commonly used in survey research, psychology, and health sciences. Examples include symptom checklists, scales like a depression index with multiple items, or questions allowing respondents to select all choices that apply to them.

The `depressive` dataset contains eight variables that share the same variable stem: `dep`, with each one representing a different item used to measure depression.

```{r}
names(depressive)
```

Using the `select_tbl()` function, we can summarize participants' responses to these items by showing how many respondents chose each answer option (i.e., value) for every variable.

```{r}
select_tbl(data = depressive, var_stem = "dep")
```

Alternatively, you can choose to summarize specific variables by passing their names to the `var_stem` argument and setting the `var_input` argument to `"name"`.

```{r}
select_tbl(data = depressive, 
           var_stem = c("dep_1", "dep_4", "dep_6"),
           var_input = "name")
```

By default, missing values are removed using listwise deletion. To switch to pairwise deletion instead, set `na_removal = "pairwise"`. 

```{r}
select_tbl(data = depressive, 
           var_stem = "dep",
           na_removal = "pairwise")
```

To display the output in the wide format, set `pivot = "wider"`.

```{r}
select_tbl(data = depressive, 
           var_stem = "dep",
           na_removal = "pairwise",
           pivot = "wider")
```

It's common practice to group multiple response or ordinal variables by another variable. This type of descriptive analysis allows for meaningful comparisons across different segments of your dataset. With `select_group_tbl()`, you can create a summary table for multiple response and ordinal variables, grouped either by another variable in your dataset or by matching a pattern in the variable names. For example, we often want to summarize survey responses by race.

First, recode the `race` variable and the values for each of the eight depressive index variables in the `depressive` dataset, replacing numeric categories with descriptive string labels for easier interpretation.

```{r}
dep_recoded <- 
  depressive |>
  dplyr::mutate(
    race = dplyr::case_match(.x = race,
                             1 ~ "Hispanic", 
                             2 ~ "Black", 
                             3 ~ "Non-Black/Non-Hispanic",
                             .default = NA)
  ) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("dep"),
      .fns = ~ dplyr::case_when(.x == 1 ~ "often", 
                                .x == 2 ~ "sometimes", 
                                .x == 3 ~ "hardly ever")
    ))
```  

Next, use the `select_group_tbl()` function to summarize responses for all eight variables by `race`:

```{r}
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race")
```

As with `select_tbl()`, setting the pivot argument to `"wider"` reshapes the table into the wide format, while using `"pairwise"` for the `na_removal` argument ensures missing values are addressed through pairwise deletion.

```{r}
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider")
```

The `ignore` argument can be used to exclude specific values from analysis. In the example below, the value `often` is removed from all eight depression index variables, and the `Non-Black/Non-Hispanic` category is excluded from the race variable.

```{r}
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider",
                 ignore = c(dep = "often", race = "Non-Black/Non-Hispanic"))
```    

When `group_type` is set to `variable` (the default), the `margins` argument controls how percentages are calculated and presented.

```{r}
# Default: percentages across each variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 na_removal = "pairwise",
                 pivot = "wider")

# Rowwise: for each value of the variable, the percentages 
# across all levels of the grouping variable sum to one
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "rows",
                 na_removal = "pairwise",
                 pivot = "wider")

# Columnwise: for each level of the grouping variable, 
# the percentages across all values of the variable sum 
# to one.
select_group_tbl(data = dep_recoded, 
                 var_stem = "dep",
                 group = "race",
                 margins = "columns",
                 na_removal = "pairwise",
                 pivot = "wider")
```  

Another way to use `select_group_tbl()` is to summarize responses that match a specific pattern, such as survey waves or time points. To enable this feature, set `group_type = "pattern"` and provide the desired pattern in the group argument. For example, the `stem_social_psych` dataset contains variables that capture student responses about their sense of belonging in the STEM community at two distinct time points: "w1" and "w2". You can summarize these responses using a pattern-based approach, where the time points (e.g., "w1" and "w2") serve as grouping variables.

```{r}
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern")
``` 

Use the `group_name` argument to assign a descriptive name to the column containing the matched pattern values.

```{r}
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave")
``` 

You can also include variable labels in your summary table by using the `var_labels` argument.

```{r}
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 var_labels = c(
                   belong_belongStem_w1 = "I feel like I belong in STEM (wave 1)",
                   belong_belongStem_w2 = "I feel like I belong in STEM (wave 2)"
                 ))
```

Finally, use the `only` argument to choose what information to return.

```{r}
# Default: counts and percentages
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave")

# Counts only
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 only = "count")

# Percentages only
select_group_tbl(data = stem_social_psych, 
                 var_stem = "belong_belong",
                 group = "_w\\d",
                 group_type = "pattern",
                 group_name = "wave",
                 only = "percent")
```

## Working with continuous variables

Finally, let’s look at how to use the `mean_tbl()` and `mean_group_tbl()` functions to summarize continuous variables. The `mean_tbl()` function allows you to generate descriptive statistics for either a set of continuous variables that share a common stem or for individual continuous variables. The resulting summary table includes key metrics such as the variable's mean, standard deviation, minimum value, maximum value, and the count of non-missing observations for each variable.

The `sdoh` dataset contains six variables describing characteristics of health care facilities, all of which begin with the prefix `HHC_PCT`. Using the `mean_tbl()` function, you can generate summary statistics for these variables:

```{r}
mean_tbl(data = sdoh, var_stem = "HHC_PCT")
``` 

Alternatively, if you want to generate summary statistics for only a subset of those variables, you can specify their names directly in the `var_stem` argument and set `var_input = "name"` to indicate you're referencing variable names rather than a shared stem.

```{r}
mean_tbl(
  data = sdoh,
  var_stem = c("HHC_PCT_HHA_PHYS_THERAPY",
               "HHC_PCT_HHA_OCC_THERAPY",
               "HHC_PCT_HHA_SPEECH"),
  var_input = "name"
)
``` 

You can also specify how missing values are removed, using the `na_removal` argument.

```{r}
# Default listwise removal
mean_tbl(data = sdoh, var_stem = "HHC_PCT")

# Pairwise removal
mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise")
``` 

Consider adding variable labels using the `var_labels` argument to help make the variable names easier to interpret.

```{r}
mean_tbl(data = sdoh, 
         var_stem = "HHC_PCT",
         na_removal = "pairwise",
         var_labels = c(
           HHC_PCT_HHA_NURSING="% agencies offering nursing care services",
           HHC_PCT_HHA_PHYS_THERAPY="% agencies offering physical therapy services",
           HHC_PCT_HHA_OCC_THERAPY="% agencies offering occupational therapy services",
           HHC_PCT_HHA_SPEECH="% agencies offering speech pathology services",
           HHC_PCT_HHA_MEDICAL="% agencies offering medical social services",
           HHC_PCT_HHA_AIDE="% agencies offering home health aide services"
         ))
``` 

Similar to working with multiple response variables, it's common practice to group continuous variables by another variable to enable meaningful comparisons across different segments of a dataset. The `mean_group_tbl()` function facilitates this type of descriptive analysis by generating summary statistics for continuous variables, grouped either by a specific variable in the dataset or by matching patterns in variable names. For example, it's often useful to present summary statistics by demographic categories such as region, gender, age, or race.

```{r}
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               group_type = "variable")
``` 

You can control which values to exclude and how missing data is handled using the `ignore` and `na_removal` arguments. To specify values to ignore, use a named vector or list, where each name corresponds to a variable stem or specific variable name.

```{r}
# Default listwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

# Pairwise removal
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = c(HHC_PCT = 0, REGION = "Northeast"))

# Pairwise removal excluding several values from the same stem 
# or group variable.
mean_group_tbl(data = sdoh, 
               var_stem = "HHC_PCT",
               group = "REGION",
               na_removal = "pairwise",
               ignore = list(HHC_PCT = 0, REGION = c("Northeast", "South")))
``` 

Another way to use `mean_group_tbl()` is to summarize responses based on a shared pattern, such as survey time points. To enable this feature, set `group_type = "pattern"` and specify the desired pattern in the group argument. 

Consider a dataset compiled by researchers examining how many symptoms participants reported they'd had after a long illness. In this (fictitious) dataset, responses are collected at three time points: "t1" (baseline), "t2" (6-month follow-up), and "t3" (one-year follow-up). Using a pattern-based approach, you can group variables by these time points to generate summary statistics for each phase of data collection.

In the example below, we first create the `symptoms_data` dataset and then use the `mean_group_tbl()` function to generate summary statistics for variables that begin with the prefix `symptoms` and contain a substring matching the pattern `"_t\\d"`, an underscore followed by the letter "t" and a single digit, indicating different time points. The `ignore` argument is also used to exclude the value `-999` from the analysis.

```{r}
set.seed(0803)
symptoms_data <-
  data.frame(
    symptoms_t1 = sample(c(0:10, -999), replace = TRUE, size = 50),
    symptoms_t2 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50),
    symptoms_t3 = sample(c(NA, 0:10, -999), replace = TRUE, size = 50)
  )

mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               ignore = c(symptoms = -999))
```

To make your output easier to understand, use the `group_name` argument to add a label to the column that shows grouping values or matched patterns. You can also use the `var_labels` argument to display descriptive labels for each variable.

```{r}
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999), 
               var_labels = c(symptoms_t1 = "# of symptoms at baseline",
                              symptoms_t2 = "# of symptoms at 6 months follow up",
                              symptoms_t3 = "# of symptoms at one-year follow up"))
``` 

Finally, you can choose what information to return using the `only` argument.

```{r}
# Default: all summary statistics returned
# (mean, sd, min, max, nobs)
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999))

# Means and non-missing observations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "nobs"))

# Means and standard deviations only
mean_group_tbl(data = symptoms_data, 
               var_stem = "symptoms",
               group = "_t\\d",
               group_type = "pattern",
               group_name = "time_point",
               ignore = c(symptoms = -999),
               only = c("mean", "sd"))
```