--- title: "Short Tutorial" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Short Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(ocf) ``` In this tutorial, we show how to use the `ocf` package to estimate and make inference about the conditional choice probabilities and the covariates' marginal effects. Before diving in the coding, we provide an overview of the statistical problem at hand. ## Ordered Choice Models We postulate the existence of a latent and continuous outcome variable $Y_i^*$, assumed to obey the following regression model: $$ Y_i^* = g \left( X_i \right) + \epsilon_i $$ where $X_i$ consists of a set of raw covariates, $g \left( \cdot \right)$ is a potentially non-linear regression function, and $\epsilon_i$ is independent of $X_i$. An observational rule links the observed outcome $Y_i$ to the latent outcome $Y_{i}^*$ using unknown threshold parameters $- \infty = \zeta_0 < \zeta_1 < \dots < \zeta_{M - 1} < \zeta_M = \infty$ that define intervals on the support of $Y_i^*$, with each interval corresponding to one of the $M$ categories or classes of $Y_i$: $$ \zeta_{m - 1} < Y_i^* \leq \zeta_{m} \implies Y_i = m, \quad m = 1, \dots, M $$ The statistical targets of interest are the conditional choice probabilities: $$ p_m \left( X_i \right) := \mathbb{P} \left( Y_i = m | X_i \right) $$ and the marginal effect of the $j$-th covariate on $p_m \left( \cdot \right)$: $$ \nabla^j p_m \left( x \right) := \begin{cases} \frac{\partial p_m \left( x \right)}{\partial x_j}, & \text{if } x_j \text{ is continuous} \\ p_m \left( \lceil x_j \rceil \right) - p_m \left( \lfloor x_j \rfloor \right), & \text{if } x_j \text{ is discrete} \end{cases} $$ where $x_j$ is the $j$-th element of the vector $x$ and $\lceil x_j \rceil$ and $\lfloor x_j \rfloor$ correspond to $x$ with its $j$-th element rounded up and down to the closest integer. ## Code For illustration purposes, we generate a synthetic data set. Details about the employed DGP can be retrieved by running `help(generate_ordered_data)`. ```{r data-generation, eval = TRUE} ## Generate synthetic data. set.seed(1986) n <- 100 data <- generate_ordered_data(n) sample <- data$sample Y <- sample$Y X <- sample[, -1] table(Y) head(X) ``` ### Conditional Probabilities To estimate the conditional probabilities, the `ocf` function constructs a collection of forests, one for each category of `Y` (three in this case). We can then use the forests to predict out-of-sample using the `predict` method. `predict` returns a matrix with the predicted probabilities and a vector of predicted class labels (each observation is labelled to the highest-probability class). ```{r adaptive-ocf, eval = TRUE} ## Training-test split. train_idx <- sample(seq_len(length(Y)), floor(length(Y) * 0.5)) Y_tr <- Y[train_idx] X_tr <- X[train_idx, ] Y_test <- Y[-train_idx] X_test <- X[-train_idx, ] ## Fit ocf on training sample. Use default settings. forests <- ocf(Y_tr, X_tr) ## Summary of data and tuning parameters. summary(forests) ## Out-of-sample predictions. predictions <- predict(forests, X_test) head(predictions$probabilities) table(Y_test, predictions$classification) ``` To produce consistent and asymptotically normal predictions, we need to set the `honesty` argument to TRUE. This makes the `ocf` function using different parts of the training sample to construct the forests and compute the predictions. ```{r honest-ocf, eval = TRUE} ## Honest forests. honest_forests <- ocf(Y_tr, X_tr, honesty = TRUE) honest_predictions <- predict(honest_forests, X_test) head(honest_predictions$probabilities) ``` To estimate standard errors for the predicted probabilities, we need to set the `inference` argument to TRUE. However, this works only if `honesty` is TRUE, as the formula for the variance is valid only for honest predictions. Notice that the estimation of the standard errors can considerably slow down the routine. However, we can increase the number of threads used to construct the forests by using the `n.threads` argument. ```{r honest-ocf-inference, eval = TRUE} ## Compute standard errors. Do not run. # honest_forests <- ocf(Y_tr, X_tr, honesty = TRUE, inference = TRUE, n.threads = 0) # Use all CPUs. # head(honest_forests$predictions$standard.errors) ``` ### Covariates' Marginal Effects To estimate the covariates' marginal effects, we can post-process the conditional probability predictions. This is performed by the `marginal_effects` function that can estimate mean marginal effects, marginal effects at the mean, and marginal effects at the median, according to the `eval` argument. ```{r adaptive-me, eval = TRUE} ## Marginal effects at the mean. me_atmean <- marginal_effects(forests, eval = "atmean") # Try also 'eval = "atmean"' and 'eval = "mean"'. print(me_atmean) # Try also 'latex = TRUE'. ``` As before, we can set the `inference` argument to TRUE to estimate the standard errors. Again, this requires the use of honest forests and can considerably slow down the routine. ```{r honest-me, eval = TRUE} ## Compute standard errors. honest_me_atmean <- marginal_effects(honest_forests, eval = "atmean", inference = TRUE) print(honest_me_atmean) # Try also 'latex = TRUE'. ```