---
title: "Additional Predictor with Maximum Effect Size"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{intro}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
library(knitr)
opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Introduction

This vignette of package **`maxEff`** ([Github](https://github.com/tingtingzhan/maxEff),
[RPubs](https://rpubs.com/tingtingzhan/maxEff)) documents three methods of selecting one from many `numeric` predictors for a regression model, to ensure that the additional predictor has the maximum effect size.   The three methods are implemented in functions `add_num()`, `add_dummy()` and `add_dummy_partition()`.


## Note to Users

Examples in this vignette require that the `search` path has

```{r setup}
library(maxEff)
library(groupedHyperframe)
library(survival)
library(rpart)
```

Users should remove parameter `mc.cores = 1L` from all examples and use the default option, which engages all CPU cores on the current host for macOS. The authors are forced to have `mc.cores = 1L` in this vignette in order to pass `CRAN`'s submission check.


## Terms and Abbreviations

```{r echo = FALSE, results = 'asis'}
c(
  '', 'Forward pipe operator', '`?base::pipeOp` introduced in `R` 4.1.0', 
  '`abs`', 'Absolute value', '`base::abs`',
  '`coxph`', 'Cox model', '`survival::coxph`',
  '`CRAN`, `R`', 'The Comprehensive R Archive Network', 'https://cran.r-project.org',
  '`factor`', 'Factor, or categorical variable', '`base::factor`',
  '`function`', '`R` function', '``base::`function` ``',
  '`groupedHyperframe`', 'Grouped hyper data frame', ' `groupedHyperframe::as.groupedHyperframe`',
  '`head`', 'First parts of an object', '`utils::head`; `utils:::head.default`',
  '`hypercolumns`, `hyperframe`', '(Hyper columns of) hyper data frame', '`spatstat.geom::hyperframe`',
  '`labels`', 'Labels from object', '`base::labels`; `maxEff::labels.node1`',
  '`levels`', 'Levels of a `factor`', '`base::levels`',
  '`listof`', 'List of objects', '`stats::listof`',
  '`logistic`', 'Logistic regression model', '`stats::glm(., family = binomial(\'logit\'))`',
  '`matrix`', 'Matrix', '`base::matrix`',
  '`partition`', 'Stratified partition', '`maxEff::statusPartition`, `caret::createDataPartition`', 
  '`PFS`', 'Progression/recurrence free survival', 'https://en.wikipedia.org/wiki/Progression-free_survival',
  '`predict`', 'Model prediction', '`stats::predict`; `maxEff::predict.add_num`; `maxEff::predict.add_dummy`',
  '`quantile`', 'Quantile', '`stats::quantile`',
  '`rpart`', 'Recursive partitioning and regression trees', '`rpart::rpart`',
  '`S3`, `generic`, `methods`', '`S3` object oriented system',  '`base::UseMethod`; `utils::methods`; `utils::getS3method`; https://adv-r.hadley.nz/s3.html',
  '`sort_by`', 'Sort an object by some criterion', '`base::sort_by`; `maxEff::sort_by.add_`',
  '`subset`', 'Subsets of object by conditions', '`base::subset`; `maxEff::subset.add_dummy`',
  '`Surv`', 'Survival object', '`survival::Surv`',
  '`update`', 'Update and re-fit a model call', '`stats::update`'
) |>
  matrix(nrow = 3L, dimnames = list(c('Term / Abbreviation', 'Description', 'Reference'), NULL)) |>
  t.default() |>
  as.data.frame.matrix() |> 
  kable()
```

## Acknowledgement

This work is supported by NCI R01CA222847 ([I. Chervoneva](https://orcid.org/0000-0002-9104-4505), [T. Zhan](https://orcid.org/0000-0001-9971-4844), and [H. Rui](https://orcid.org/0000-0002-8778-261X)) and R01CA253977 (H. Rui and I. Chervoneva). 


# Handy Tools

## `node1()`: Dichotomization via First Node of Recursive Partitioning

Consider the following recursive partitioning and regression example,
```{r}
data(cu.summary, package = 'rpart')
(r = rpart(Price ~ Mileage, data = cu.summary, cp = .Machine$double.eps, maxdepth = 1L))
```

Function `node1()` creates a dichotomizing rule, i.e., a **`function`**, based on the first node of partitioning,

```{r}
(foo = r |> node1())
```


The dichotomizing rule `foo` converts a `numeric` object to a `logical` object.

```{r}
set.seed(125); rnorm(6, mean = 24.5) |> foo()
```

Developers may obtain the `numeric` cutoff value of `foo`, or a brief text of to describe `foo`, for further analysis.

```{r}
foo |> get_cutoff()
foo |> labels()
```


## `statusPartition()`: Stratified Partition by Survival Status

Function `statusPartition()` performs stratified partition based on the survival status of a `Surv` response, to avoid the situation that Cox model is degenerate if all subjects are censored. This is a deviation from the very popular function `caret::createDataPartition()`, which stratifies `Surv` response by the `quantile`s of the survival time.
```{r}
y = (survival::capacitor) |> 
  with(expr = Surv(time, status))
set.seed(15); id = y |>
  statusPartition(times = 1L, p = .5)
table(y[id[[1L]], 2L]) / table(y[,2L]) # balanced by status
set.seed(15); id0 = y |>
  caret::createDataPartition(times = 1L, p = .5)
table(y[id0[[1L]], 2L]) / table(y[,2L]) # *not* balanced by status
```

# Data Preparation

Data set `Ki67` in package **`groupedHyperframe`** is a `groupedHyperframe`. 

```{r}
data(Ki67, package = 'groupedHyperframe')
Ki67
```

```{r}
s = Ki67 |>
  aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))
```

Candidate of additional predictors are stored in a `numeric`-`hypercolumn` *`logKi67.quantile`*.
Users are encouraged to learn more about `groupedHyperframe` class and function `aggregate_quantile()` from package **`groupedHyperframe`** [vignette](https://rpubs.com/tingtingzhan/groupedHyperframe).


Partition into a training (80%) and test (20%) set.

```{r}
set.seed(234); id = sample.int(n = nrow(s), size = nrow(s)*.8) |> sort.int()
s0 = s[id, , drop = FALSE] # training set
s1 = s[-id, , drop = FALSE] # test set
```

Let's consider a starting model of endpoint `PFS` with predictor `Tstage` on the training set `s0`.

```{r}
summary(m <- coxph(PFS ~ Tstage, data = s0))
```

# Additional `numeric` Predictor with Maximum Effect Size

Function `add_num()` treats each additional predictor as a `numeric` variable, and `update`s the starting model with each additional predictor.  Function `add_num()` returns an `'add_num'` object, which is a `listof` objects with an internal class `'add_num_'`.

The `S3` generic `sort_by()` sorts the `'add_num'` object by the `abs`olute value of the regression coefficient (i.e., effect size) of the additioanal `numeric` predictor.

The `S3` generic `head()` chooses the top `n` element from the object returned from the previous step.

```{r}
set.seed(14837); m1 = m |>
  add_num(x = ~ logKi67.quantile, mc.cores = 1L) |>
  sort_by(y = abs(effsize)) |>
  head(n = 2L)
m1
```

The S3 generic `predict()` uses model `m1` on the test set `s1`.

```{r}
m1[1L] |> predict(newdata = s1)
```


# Additional `logical` Predictor with Maximum Effect Size

## `add_dummy()`: Naive Practice

Function `add_dummy()` partitions each additional `numeric` predictor into a `logical` variable using function `node1()`, and `update`s the starting model by adding in each of the dichotomized `logical` predictor. Function `add_dummy()` returns an `'add_dummy'` object, which is a `listof` `node1` objects.

The `S3` generic `subset()` subsets the the `'add_dummy'` object by the balance of partition of the additional predictor. 

The `S3` generic `sort_by()` sorts the `'add_dummy'` object by the `abs`olute value of regression coefficient (i.e., effect size) of the additional `logical` predictor.

The `S3` generic `head()` chooses the top `n` element from the object returned from the previous step.


```{r}
set.seed(14837); m2 = m |>
  add_dummy(x = ~ logKi67.quantile, mc.cores = 1L) |>
  subset(subset = p1 > .05 & p1 < .95) |> 
  sort_by(y = abs(effsize)) |>
  head(n = 2L)
m2
```

The S3 generic `predict()` uses model `m2` on the test set `s1`.

```{r}
m2[1L] |> predict(newdata = s1)
```


## `add_dummy_partition()`: via Repeated Partitions

Function `add_dummy_partition()` partitions each additional `numeric` predictor into a `logical` variable in the following steps.

1. Generate multiple, i.e., repeated, partitions.
2. For each partition, create a dichotomizing rule (via function `node1()`) on the training set. Apply this dichotomizing rule on the test set and obtain the estimated regression coefficient (i.e., effect size) of the additional `logical` predictor.
3. Among all partitions, select the one with median effect size of the additional `logical` predictor.

Function `add_dummy_partition()` also returns an `'add_dummy'` object.

```{r}
set.seed(14837); m3 = m |> 
  add_dummy_partition(~ logKi67.quantile, times = 20L, mc.cores = 1L) |>
  subset(subset = p1 > .15 & p1 < .85) |>
  sort_by(y = abs(effsize), decreasing = TRUE) |>
  head(n = 2L)
m3
```


```{r}
m3[1L] |> predict(newdata = s1)
```