---
title: "Fast Algorithms for Quantile Regression with Selection: A Vignette"
author: "Santiago Pereda Fernández"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fast Algorithms for Quantile Regression with Selection: A Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
biblio-style: apalike
link-citations: true
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


#Introduction

In this document I illustrate how one can use the package \texttt{fastqrs} to estimate the quantile selection model considered by \cite{Arellano2017} using the algorithms proposed by \cite{Pereda2024a} to speed up the computation. The results from this package are the estimates of Quantile Regression with Selection (QRS, \citep{Arellano2017}), an estimator that generalizes both Quantile Regression(QR, \citep{Koenker1978}) and Heckman's selection estimator \citep{Heckman1979}.

#Setup

First, make sure that the \texttt{fastqrs} package is and loaded correctly.

```{r setup, include=FALSE}
library(fastqrs)
```

This package contains several functions, many of which are intermediate functions called by the main functions that can are directly used by the user. Namely, the functions \texttt{qrs.fast} and \texttt{qrs.fast.bt} respectively compute the QRS estimator and a weighted bootstrap of it that can be used for inference.


#Using fastqrs

The main function to use is \texttt{qrs.fast}. On top of the package \texttt{fastqrs}, it also uses the packages \texttt{quantreg}, \texttt{copula}, and \texttt{sampleSelection}. The former is used because the algorithms described in \cite{Pereda2024a} are a modification of the linear programming problem of standard quantile regression. \texttt{copula} is used to compute the conditional copulas that rotate the so-called check function. Lastly, \texttt{sampleSelection} is used to compute the propensity score, which is the first step of the estimation of QRS.

```{r load-libraries, echo=TRUE}
# Load the package
library("sampleSelection")
library("quantreg")
library("fastqrs")
library("copula")
library("ggplot2")
```

To illustrate the estimator I use the Mroz data set, consisting of 753 observations from the U.S. Women's Labor Force Participation. This data set includes a number of individual variables that can be used to define a sample selection model. To keep the analysis contained and clear, the specification will follow the same used by \cite{vigss} in the \texttt{sampleSelecion} vignette.

The dependent variable, y, is the worker's average hourly earnings. This variable is observed for the subsample of women who are employed, denoted by d. For the remaining observations, the wage is set to 0. The wage is modeled as a quantile linear process using the set of covartiates x, which includes the a quadratic polynomial of work experience, the number of years of education and a dummy for those who were residents in an urban area. In contrast, the being employed is modeled as a function of a quadratic polynomial of age, family income, the number of kids, and the number of years of education. Lastly, each observation is given the same weight, denoted by w.

```{r load-data, echo=TRUE}
data("Mroz87")
Mroz87$kids <- ( Mroz87$kids5 + Mroz87$kids618 > 0 )

y <- Mroz87$wage
d <- Mroz87$lfp
x <- cbind(Mroz87$exper, Mroz87$exper^2,Mroz87$educ,
           Mroz87$city)
z <- cbind(Mroz87$age, Mroz87$age^2, Mroz87$faminc,
           Mroz87$kids, Mroz87$educ)
w <- rep(1,NROW(y))
```

The \texttt{qrs.fast} function requires some input parameters to work. First, it is necessary to specify how the propensity score is computed and which parametric copula is used in the estimation. For the former, the program uses the \texttt{glm} function, which includes two of the most popular binary choice estimators: probit and logit. As for the copula, one can choose from 4 different 1-parameter copulas: Gaussian, Clayton, Frank, and Gumbel. In this example, we use a probit to compute the propensity score, and a Gaussian copula. Moreover, it is necessary to defined the grid of values of the copula parameter where the criterion function to obtain its estimate is evaluated. Since the parameter for the Gaussian copula lies inside the $\left[-1,1\right]$ interval, the grid used in this example takes a number of evenly spaced values of this interval, denoted by gridtheta.

In addition, there are some parameters that are linked to the estimation algorithm. Q1 and Q2 are the number of quantiles used at different stages of the algorithms. Q1 is the reduced quantile grid that is used to speed up the estimation. Q2 is the large quantile grid, and the resulting estimates are the coefficients at these quantiles. In this example, they are equal to 9 and 19, respectively. P is the number of candidate values of the copula parameter that are chosen when the criterion function is approximated by the reduced quantile grid, and in this example it is equal to 10. Lastly, m is used to determine the width of the intervals used for preprocessing, which is set to 1.

```{r set-parameters, echo=TRUE}
link <- "probit"
family <- "Gaussian"
gridtheta <- seq(from = -1, to = 1, by = .05)
Q1 <- 9
Q2 <- 19
P <- 10
m <- 1
```

Using all the previously defined variables and parameters, the \texttt{qrs.fast} function is run to yield several results. First, it yields the coefficients from the binary choice model, which in this example are the probit coefficients. Second, it yields the quantile coefficients, which are computed for the large quantile grid. These are the estimates corrected for sample selection. Third, it returns the value of the estimated copula parameter, which is informative about the amount and direction of selection in the sample. Fourth, it is m, the value of the criterion function evaluated at the estimated quantile and copula coefficients. Lastly, it returns an array of preliminary estimates of the quantile coefficients evaluated at all the values of the copula parameter grid and the reduced quantile grid. This array is important as an initial value to speed up the estimates of the bootstrap.

```{r estimation, echo=TRUE}
qrs <- qrs.fast(y, x, d, z, w, Q1, Q2, P, link, family,
                gridtheta, m)
```

In this example, the copula coefficient equals 0.7, which denotes a high level of negative sample selection. In words, women with a higher probability of being employed were those with a lower wage if employed. This number is much larger than the estimated correlation coefficient found in \cite{Greene2003} with either the 2-stage Heckman estimator or the MLE version of it.

The beta coefficients are stored in \texttt{qrs}\$\texttt{beta}. For each of the 5 covariates (including the intercept, which is the first coefficient) there are 19 coefficients, one for each of the vingtiles used in the estimation. The following graph shows the estimated quantile process for the last covariate, urban. For the first half of the distribution, the estimated coefficient is very close to zero, indicating that for about a half of the female workers, wages are similar for those in urban or rural areas. However, the right tail is increasing, showing an urban premium that affects those that would rank higher in the conditional distribution of potential outcomes.

```{r graph1, echo=TRUE, fig=TRUE, fig.width=6, fig.height=4}
data_for_plot <- data.frame(
   Quantile = seq(1/(Q2+1), Q2/(Q2+1), length.out = Q2),
   Coefficient = as.numeric(qrs$beta[5, ]))
ggplot(data_for_plot, aes(x = Quantile, y = Coefficient)) +
   geom_line(color = "blue", linewidth = 1) +
   geom_point(color = "red", size = 2) +
   labs(
       title = "QRS Coefficients for Urban",
       x = "Quantile",
       y = "Estimated Coefficient"
   )
```

To conduct inference, one can use the weighted bootstrap, as shown in \cite{Pereda2024b}. This bootstrap, rather than sampling at random with repetition from the observations available, it draws positive weights for each observation. Then, the estimates are obtained using these weights in the estimation. The \texttt{qrs.fast.bt} function does this, computing the bootstrap for all of the estimated parameters. To make it work, on top of the parameters used for the \texttt{qrs.fast} function, one needs to specify the number of repetitions, reps, the significance level(s), alpha, and use the estimates with the reduced quantile grid to speed up the estimation. The latter is a result obtained in the estimation that can be accessed at \texttt{qrs}\$\texttt{b}.

```{r bootstrap, echo=TRUE}
if (interactive()) {
  set.seed(1)
  reps<-200
  alpha <- c(0.1,0.05,0.01)
  qrs.bt <- qrs.fast.bt(y, x, d, z, w, Q1, Q2, P, link, family,
                        gridtheta, m, qrs$b1, reps, alpha)
} else {
  #bootstrap_results <- load(system.file("extdata", "bootstrap_results.RData", package="fastqrs"))
  bootstrap_file <- system.file("extdata", "bootstrap_results.RData", package="fastqrs")
  if (bootstrap_file != "") {
    load(bootstrap_file)
  } else {
    stop("Bootstrap results file not found. Make sure the package is correctly installed.")
  }
}
```

The bootstrap results show that the standard error for the copula parameter is large, making it not significantly different from zero. This result is not unexpected, as it was also the case for the Heckman 2-stage estimator and the MLE, as shown in \cite{Greene2003}. If we look at the quantile coefficients, we can see that these are not significant either. The following graph shows the estimate of the variable urban, along with the 90\% confidence bands. For almost all considered quantiles, 0 lies inside these bands, making these effects not significant. Again, this is not unexpected, as the urban coefficient estimated by \cite{Greene2003} was also not significant. Therefore, the estimates provide some evidence of heterogeneous effects, but this evidence is weak, as it is not possible to reject the null hypothesis of no effect.

```{r graph2, echo=TRUE, fig=TRUE, fig.width=6, fig.height=4}
data_for_plot <- data.frame(
   Quantile = seq(1/(Q2+1), Q2/(Q2+1), length.out = Q2),
   Coefficient = as.numeric(qrs$beta[5, ]),
   Upper = as.numeric(qrs.bt$betaub[5, ,1]),
   Lower = as.numeric(qrs.bt$betalb[5, ,1])
)
ggplot(data_for_plot, aes(x = Quantile)) +
   # Line for coefficients
   geom_line(aes(y = Coefficient), color = "blue", linewidth = 1) +
   geom_point(aes(y = Coefficient), color = "blue", size = 2) +
   # Lines for confidence bands
   geom_line(aes(y = Upper), color = "red",
             linetype = "dashed", linewidth = 0.8) +
   geom_line(aes(y = Lower), color = "red",
             linetype = "dashed", linewidth = 0.8) +
   # Add labels and title
   labs(
   title = "QRS Coefficients for Urban with 90% Confidence Bands",
   x = "Quantile",
   y = "Estimated Coefficient"
   )
```