---
title: "Regression Examples"
author: "Josie Athens"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    number_sections: yes
    fig_height: 4
    fig_width: 5
vignette: >
  %\VignetteIndexEntry{Regression Examples}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

This vignette aims to illustrate the use of `pubh` functions for typical regression analysis in Public Health. In particular, the vignette shows the use of the following functions from `pubh`:

1. `cosm_reg` for reporting tables of coefficients.
2. `cross_tbl` for reporting tables of descriptive statistics by exposure of interest.
3. `multiple` for adjusting confidence intervals and p-values for post-hoc analysis.
4. `get_r2` for accessing $R^2$ or pseudo-$R^2$ from regression models.
5. `glm_coef` for some special cases of regression models.

The advantages and limitations of `glm_coef` are:

1. Recognises the main models used in epidemiology/public health.
2. Automatically back-transforms estimates and confidence intervals, when the model requires it.
3. Can use robust standard errors for the calculation of confidence intervals.
    - Standard errors are used by default.
    - The use of standard errors is restricted by the following classes of objects (models): `gee`, `glm` and `survreg`.
4. Can display nice labels for the names of the parameters.
5. Returns a data frame that can be modified and/or exported as tables for publications (with further editing).

We start by loading relevant packages and setting the theme for the plots (as suggested in the Template of this package):

```{r message=FALSE, results = 'hide'}
rm(list = ls())
library(dplyr)
library(rstatix)
library(easystats)
library(pubh)
library(sjlabelled)

theme_set(see::theme_lucid(base_size = 10))
theme_update(legend.position = "top")
options('huxtable.knit_print_df' = FALSE)
options('huxtable.autoformat_number_format' = list(numeric = "%5.2f"))
knitr::opts_chunk$set(comment = NA)
```

# Multiple Linear Regression

There is no need to exponentiate the results for continuous outcomes unless the outcome fits in the log scale. In our first example, we want to estimate the effect of smoking and race on babies' birth weights.

We can generate factors and assign labels in the same `pipe` stream:

```{r}
data(birthwt, package = "MASS")
birthwt <- birthwt |>
  mutate(
    smoke = factor(smoke, labels = c("Non-smoker", "Smoker")),
    race = factor(race, labels = c("White", "African American", "Other"))
    ) |>
  var_labels(
    bwt = 'Birth weight (g)',
    smoke = 'Smoking status',
    race = 'Race'
    )
```

Is good to start with some basic descriptive statistics, so we can compare the birth weight between groups.

Graphical analysis:

```{r}
birthwt |>
  box_plot(bwt ~ smoke, fill = ~ race)
```

Another way to compare the means between the groups is with `gen_bst_df`, which estimates the means of corresponding bootstrapped CIs.

```{r}
birthwt |>
  gen_bst_df(bwt ~ race|smoke) |>
  as_hux() |> theme_pubh()
```

We fit a linear model.

```{r}
model_norm <- lm(bwt ~ smoke + race, data = birthwt)
```

> **Note:** Model diagnostics are not be discussed in this vignette.

Traditional output from the model:

```{r}
model_norm |> Anova()
```

```{r}
model_norm |> parameters()
```

```{r}
model_norm |> performance()
```

Table of coefficients for publication:

```{r}
model_norm |>
  tbl_regression() |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_norm), font_size = 9)
```

```{r}
model_norm |> 
  glm_coef(labels = model_labels(model_norm)) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_norm), font_size = 9)
```

Function `glm_coef` allows the use of robust standard errors.

```{r}
model_norm |>
  glm_coef(se_rob = TRUE, labels = model_labels(model_norm)) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(paste(
    get_r2(model_norm), "\n",
    "CIs and p-values estimated with robust standard errors."),
    font_size = 9)
```

We can use functions from `easystats`: `estimate_means` calculates the mean birth weight for each combination with corresponding confidence intervals.

```{r}
model_norm |>
  estimate_means(c("race", "smoke"))
```

We can plot the output from `estimate_means` using a recipe from `see`:

```{r}
model_norm |> 
  estimate_means(by = c("race", "smoke")) |> 
  plot() |> 
  gf_labs(
    x = "", y = "Birth weight (g)", title = "",
    col = "Smoking status"
  )
```

We have more flexibility and construct a more tidy plot too:

```{r}
model_norm |>
  estimate_means(c("race", "smoke")) |>
  gf_point(Mean ~ race) |>
  gf_errorbar(CI_low + CI_high ~ race, width = 0.3) |>
  gf_facet_wrap(~smoke) |>
  gf_labs(
    x = "Race",
    y = "Birth weight (g)"
  )
```


When the explanatory variables are categorical, another option is `emmip` from the `emmeans` package. We can include CIs in `emmip` but as estimates are *connected*, the resulting plots look messy, so I recommend  `emmip` to look at the trace.

```{r}
emmip(model_norm, smoke ~ race) |>
  gf_labs(y = get_label(birthwt$bwt), x = "", col = "Smoking status")
```

# Logistic Regression

For logistic regression, we are interested in odds ratios. We will examine the effect of fibre intake on the development of coronary heart disease.

```{r}
data(diet, package = "Epi")
diet <- diet |>
  mutate(
    chd = factor(chd, labels = c("No CHD", "CHD"))
  ) |>
  var_labels(
    chd = "Coronary Heart Disease",
    fibre = "Fibre intake (10 g/day)"
    )
```

We start with descriptive statistics:

```{r}
diet |> estat(~ fibre|chd) |>
  as_hux() |> theme_pubh()
```

```{r}
diet |> na.omit() |>
  copy_labels(diet) |>
  box_plot(fibre ~ chd)
```

We fit a linear logistic model:

```{r}
model_binom <- glm(chd ~ fibre, data = diet, family = binomial)
```

Summary:

```{r}
model_binom |> parameters(exponentiate = TRUE)
```

```{r}
model_binom |> performance()
```

Table of coefficients for publication:

```{r}
model_binom |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_binom), font_size = 9)
```

Effect plot:

```{r}
model_binom |>
  estimate_means(by = "fibre") |>
  gf_line(Probability ~ fibre) |>
  gf_ribbon(CI_low + CI_high ~ fibre) |>
  gf_labs(
    x = "Fibre intake (10 g/day)",
    y = "Probability of CHD"
  )
```

## Matched Case-Control Studies: Conditional Logistic Regression

We will look at a matched case-control study on the effect of oestrogen use and history of gall bladder disease on the development of endometrial cancer.

```{r}
data(bdendo, package = "Epi") 
bdendo <- bdendo |>
  mutate(
    cancer = factor(d, labels = c('Control', 'Case')),
    gall = factor(gall, labels = c("No GBD", "GBD")),
    est = factor(est, labels = c("No oestrogen", "Oestrogen"))
  ) |>
  var_labels(
    cancer = 'Endometrial cancer',
    gall = 'Gall bladder disease',
    est = 'Oestrogen'
  )
```

We start with descriptive statistics:

```{r}
bdendo |> 
  mutate(
    cancer = relevel(cancer, ref = "Case"),
    gall = relevel(gall, ref = "GBD"),
    est = relevel(est, ref = "Oestrogen")
  ) |>
  copy_labels(bdendo) |>
  select(cancer, gall, est) |> 
  tbl_strata(
    strata = est,
    .tbl_fun = ~ .x |>
      tbl_summary(by = gall)
  ) |> 
  cosm_sum(bold = TRUE, head_label = " ") |> 
  theme_pubh(2) |> 
  set_align(1, everywhere, "center")
```

```{r}
bdendo |>
  gf_percents(~ cancer|gall, fill = ~ est, position = "dodge", alpha = 0.6) |>
  gf_labs(
    y = "Percent",
    x = "",
    fill = ""
  )
```

We fit the conditional logistic model:

```{r}
require(survival, quietly = TRUE)
model_clogit <- clogit(cancer == 'Case' ~ est * gall + strata(set), data = bdendo)
```

```{r}
model_clogit |>
  glm_coef(labels = c("Oestrogen/No oestrogen", "GBD/No GBD",
                      "Oestrogen:GBD Interaction")) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_clogit), font_size = 9)
```

## Ordinal Logistic Regression

We use data about house satisfaction.

```{r message=FALSE}
#require(MASS, quietly = TRUE)
data(housing, package = "MASS")
housing <- housing |>
  var_labels(
    Sat = "Satisfaction",
    Infl = "Perceived influence",
    Type = "Type of rental",
    Cont = "Contact"
    )
```

We fit the ordinal logistic model:

```{r}
model_ord <- MASS::polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing, Hess = TRUE)
```

```{r}
model_ord |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_ord), font_size = 9)
```

Effect plots:

```{r}
model_ord |>
  estimate_means(by = c("Infl", "Cont", "Type")) |>
  mutate(
    Mean = inv_logit(Mean),
    CI_low = inv_logit(CI_low),
    CI_high = inv_logit(CI_high)
  ) |>
  gf_errorbar(CI_low + CI_high ~ Infl, width = 0.2) |>
  gf_point(Mean ~ Infl, size = 0.7) |>
  gf_facet_grid(~ Type + Cont) |>
  gf_labs(
    x = "Perceived influence",
    y = "Satisfaction"
  )
```

```{r}
emmip(model_ord, Infl ~ Type |Cont) |>
  gf_labs(x = "Type of rental", col = "Perceived influence")
```

# Poisson Regression

For Poisson regression, we are interested in incidence rate ratios. We will examine the effect of sex, ethnicity, and age group on the number of absent days from school in a year.

```{r}
data(quine, package = "MASS")
levels(quine$Eth) <- c("Aboriginal", "White")
levels(quine$Sex) <- c("Female", "Male")
```

```{r}
quine <- quine |>
  var_labels(
    Days = "Number of absent days",
    Eth = "Ethnicity",
    Age = "Age group"
    )
```

Descriptive statistics:

```{r}
quine |> 
  cross_tbl(by = "Eth") |> 
  theme_pubh(2) |> 
  add_footnote("n (%); Median (IQR)", font_size = 9)
```


```{r}
quine |>
  box_plot(Days ~ Age|Sex, fill = ~ Eth)
```

We start by fitting a standard Poisson linear regression model:

```{r}
model_pois <- glm(Days ~ Eth + Sex + Age, family = poisson, data = quine)
```

```{r}
model_pois |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_pois), font_size = 9)
```

```{r}
model_pois |> performance()
```

## Negative-binomial

The assumption is that the mean is equal to the variance. If that is the case, deviance should be close to the degrees of freedom of the residuals. We can check for overdispersion:

```{r}
model_pois |> check_overdispersion()
```

Thus, we have overdispersion. One option is to use a negative binomial distribution.

```{r}
model_negbin <- MASS::glm.nb(Days ~ Eth + Sex + Age, data = quine)
```

```{r}
model_negbin |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_negbin), font_size = 9)
```

```{r}
model_negbin |> performance()
```

Effect plot:

```{r}
model_negbin |>
  estimate_means(by = c("Age", "Eth", "Sex")) |>
  gf_errorbar(CI_low + CI_high ~ Age, width = 0.2) |>
  gf_point(Mean ~ Age, size = 1) |>
  gf_facet_wrap(~ Eth + Sex) |>
  gf_labs(
    x = "Age group",
    y = "Number of absent days"
  )
```

```{r}
emmip(model_negbin, Eth ~ Age|Sex) |>
  gf_labs(y = "Number of absent days", x = "Age group", col = "Ethnicity")
```

## Adjusting CIs and p-values for multiple comparisons

We adjust for multiple comparisons:

```{r}
multiple(model_negbin, ~ Age|Eth)$df
```

We can see the comparison graphically with:

```{r}
multiple(model_negbin, ~ Age|Eth)$fig_ci |>
  gf_labs(x = "IRR")
```


# Survival Analysis

We will use an example on the effect of thiotepa versus placebo on the development of bladder cancer.

```{r}
data(bladder)
bladder <- bladder |>
  mutate(times = stop,
         rx = factor(rx, labels=c("Placebo", "Thiotepa"))
         ) |>
  var_labels(times = "Survival time",
             rx = "Treatment")
```

## Parametric method

```{r}
model_surv <- survreg(Surv(times, event) ~ rx, data = bladder)
```

Using robust standard errors:

```{r}
model_surv |>
  glm_coef(labels = c("Treatment: Thiotepa/Placebo", "Scale"), se_rob = TRUE) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh(1) |>
  add_footnote(get_r2(model_surv), font_size = 9)
```

In this example, the scale parameter is not statistically different from one, meaning the hazard is constant, and thus, we can use the exponential distribution:

```{r}
model_exp <- survreg(Surv(times, event) ~ rx, data = bladder, dist = "exponential")
```

```{r}
model_exp |>
  glm_coef(labels = c("Treatment: Thiotepa/Placebo"), se_rob = TRUE) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_exp), font_size = 9)
```

> **Interpretation:** Patients receiving Thiotepa live on average 64% more than those in the Placebo group.

Using naive standard errors:

```{r}
model_exp |>
  glm_coef(labels = c("Treatment: Thiotepa/Placebo")) |>
  as_hux() |> set_align(everywhere, 2:3, "right") |>
  theme_pubh() |>
  add_footnote(get_r2(model_exp), font_size = 9)
```


```{r}
model_exp |>
  estimate_means(by = "rx") |>
  gf_errorbar(CI_low + CI_high ~ rx, width = 0.3) |>
  gf_point(Mean ~ rx, size = 1) |>
  gf_labs(
    x = "Treatment",
    y = "Survival time"
  )
```


## Cox proportional hazards regression

```{r}
model_cox <-  coxph(Surv(times, event) ~ rx, data = bladder)
```

```{r}
model_cox |> 
  tbl_regression(exponentiate = TRUE) |> 
  cosm_reg() |> theme_pubh() |> 
  add_footnote(get_r2(model_cox), font_size = 9)
```

```{r}
model_cox |>
  estimate_means(by = "rx")
```

> **Interpretation:** Patients receiving Thiotepa are 64% less likely of dying than those in the Placebo group.