--- title: The Art of Influence subtitle: A Practical Guide to Working with Influence Functions author: Klaus Kähler Holst Rdate: "`r Sys.Date()`" output: # html_document: rmarkdown::html_vignette: # toc_float: true fig_caption: true toc: true toc_depth: 2 mathjax: "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS_CHTML.js" vignette: > %\VignetteIndexEntry{The Art of Influence} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ref.bib --- ```{r setup,include=FALSE } library("lava") knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) has_mets <- lava:::versioncheck('mets', 1) has_geepack <- lava:::versioncheck('geepack', 1) ``` \newcommand{\arctanh}{\operatorname{arctanh}} \newcommand{\indep}{\!\perp\!\!\!\!\perp\!} \newcommand{\pr}{\mathbb{P}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathbb{V}\!\text{ar}} \newcommand{\cov}{\mathbb{C}\!\text{ov}} \newcommand{\cor}{\mathbb{C}\!\text{or}} \newcommand{\IC}{\operatorname{IC}} \newcommand{\one}{\mathbf{1}} \newcommand{\Dto}{\overset{\mathcal{D}}{\longrightarrow}} \newcommand{\bm}[1]{\mathbf{#1}} \newcommand{\Pz}{P_{0}} \newcommand{\op}{o_{P}} # Influence functions Estimators that have parametric convergence rates can often be fully characterized by their *influence function* (IF), also referred to as an influence curve or canonical gradient [@bickel_effic_adapt_estim_semip_model; @vaart_1998_asymp]. The IF allows for the direct estimation of properties of the estimator, including its asymptotic variance. Moreover, estimates of the IF enable the simple combination and transformation of estimators into new ones. This vignette describes how to estimate and manipulate IFs using the R-package `lava` [@holst_budtzjorgensen_2013]. Formally, let \(Z_1,\ldots,Z_n\) be iid \(k\)-dimensional stochastic variables, \(Z_i=(Y_{i},A_{i},W_{i})\sim \Pz\), and \(\widehat{\theta}\) a consistent estimator for the parameter \(\theta\in\mathbb{R}^p\). When \(\widehat{\theta}\) is a *regular and asymptotic linear* (RAL) estimator, it has a unique iid decomposition \begin{align*} \sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \IC(Z_i; \Pz) + \op(1), \end{align*} where the function \(\IC\) is the unique *Influence Function* s.t. \(\mathbb{E}\{\IC(Z_{i}; \Pz)\}=0\) and \(\var\{\IC(Z_{i}; \Pz)^{2}\}<\infty\) [@tsiatis2006semiparametric; @vaart_1998_asymp]. The influence function thus fully characterizes the asymptotic behaviour of the estimator and by the central limit theorem it follows that the estimator converges weakly to a Gaussian distribution \[ \sqrt{n}(\widehat{\theta}-\theta) \Dto \mathcal{N}(0, \var\{\IC(Z; \Pz)\}), \] where the empirical variance of the plugin estimator, \(\pr_{n}\IC(Z; \widehat{P})^{\otimes 2} = \frac{1}{n}\sum_{i=1}^n \IC(Z_{i}; \widehat{P})\IC(Z_{i}; \widehat{P})^{\top}\) can be used to obtain a consistent estimate of the asymptotic variance. Note, in practice the estimate \(\widehat{P}\) used in the plugin-estimate, needs only to capture the parts of the distribution of \(Z\) that is necessary to evaluate the IF. In some cases this nuisance parameter can be estimated using flexible machine learning components and in other (parametric) cases be derived directly from \(\widehat{\theta}\). The IFs are easily derived for the parameters of many parametric statistical models as illustrated in the [next example sections](#example-generalized-linear-model). More generally, the IF can also be derived for a smooth target parameter \(\Psi: \mathcal{P}\to\mathbb{R}\) where \(\mathcal{P}\) is a family of probability distributions forming the statistical model, which often can be left completely non-parametric. Formally, the parameter must be *pathwise differentiable* see [@vaart_1998_asymp] in the sense that there exists linear bounded function \(\dot\Psi\colon L_{2}(P_{0})\to\mathbb{R}\) such that \( [\Psi(P_{t}) - \Psi(P_{0}))]t^{-1} \to \dot\Psi(P_{0})(g) \) as \(t\to 0\) for any parametric submodel \(P_t\) with score model \(g(z)= \partial/(\partial t) \log (p_t)(z)|_{t=0}\). Riesz's representation theorem then tells us that the directional derivative has a unique representer, \(\phi_{P_{0}}\) lying in the closure of the submodel score space (the *tangent space*), s.t. \begin{align*} \dot\Psi(P_0)(g) = \langle\phi_{P_0}, g\rangle = \int \phi_{P_0}(Z)g(X)\,dP_0 \end{align*} The unique representer is exactly the IF which can be found by solving the above integral equation. For more details on how to derive influence functions, we refer to [@targetedlearning_2011; @hines2022]. As an example we might be interested in the target parameter \(\Psi(P) = \E_P(Z)\) which can be shown to have the unique (and thereby efficient) influence function \(Z\mapsto Z-\E_P(Z)\) under the non-parametric model. Another target parameter could be \(\Psi_{a}(P) = \E_{P}[\E_{P}(Y\mid A=a, W)]\) which is often a key interest in causal inference and which has the IF \begin{align*} \IC(Y,A,W; P) = \frac{\one(A=a)}{\pr(A=a\mid W)}(Y-\E_{P}[Y\mid A=a,W]) + \E_{P}[Y\mid A=a,W] - \Psi_{a}(P) \end{align*} See section on [average treatment effects](#average-treatment-effects). # Examples To illustrate the methods we consider data arising from the model \(Y_{ij} \sim Bernoulli\{\operatorname{expit}(X_{ij} + A_{i} + W_{i})\}, A_{i} \sim Bernoulli\{\operatorname{expit}(W_{i})\}\) with independent covariates \(X_{ij}\sim\mathcal{N}(0,1), W_{i}\sim\mathcal{N}(0,1)\). ```{r sim_model} m <- lvm() |> regression(y1 ~ x1 + a + w) |> regression(y2 ~ x2 + a + w) |> regression(y3 ~ x3 + a + w) |> regression(y4 ~ x4 + a + w) |> regression(a ~ w) |> distribution(~ y1 + y2 + y3 + y4 + a, value = binomial.lvm()) |> distribution(~id, value = Sequence.lvm(integer = TRUE)) ``` We simulate from the model where \(Y_3\) is only observed for half of the subjects ```{r simulate} n <- 4e2 dw <- sim(m, n, seed = 1) |> transform(y3 = y3 * ifelse(id > n / 2, NA, 1)) Print(dw) ## Data in long format dl <- reshape(dw, varying = list(paste0("y",1:4), paste0("x",1:4)), v.names=c("y", "x"), direction="long") |> na.omit() dl <- dl[order(dl$id), ] ## dl <- mets::fast.reshape(dw, varying = c("y", "x")) |> na.omit() Print(dl) ``` ## Example: population mean The main functions for working with influence functions are - `estimate` which prepares a model object and estimates the IF and corresponding robust standard errors. Can also be used to transform model parameters by application of the Delta Theorem. - `merge`, `+` method for combining estimates via their estimated IFs - `IC` method to extract the estimated IF The `estimate` function is the primary tool for obtaining parameter estimates and related information. It returns an object of the class type `estimate`, which is a general container for holding information about estimated parameters. The estimate function takes as input either a model object (the first argument `x`), or a parameter vector and corresponding influence function (IF) matrix specified using the `coef` and `IF` arguments. If the primary goal is to apply the delta method or test linear hypotheses, it is also possible to provide the asymptotic variance estimate via the `vcov` argument, without specifying the IF matrix. ```{r estimate.syntax, eval=FALSE} estimate(x=, ...) estimate(coef=, IF=, ...) estimate(coef=, vcov=, ...) ``` Here we first consider the problem of estimating the IF of the mean. For a general transformation \(f: \mathbb{R}^k\to\mathbb{R}^p\) we have that \[ \sqrt{n}\{\mathbb{P}_{n}f(X) - \E[f(X)]\} = \frac{1}{\sqrt{n}}\sum_{i=1}^{n} f(X_{i}) - \E[f(X)] \] and hence for the problem of estimating the proportion of the binary outcome \(Y_1\), the IF is given by \(\one(Y_{1}=1) - \pr(Y_{1}=1)\). To estimate this parameter and its IF we will use the `estimate` function ```{r inp1} inp <- as.matrix(dw[, c("y1", "y2")]) e <- estimate(inp[, 1, drop = FALSE], type="mean") class(e) e ``` The reported standard errors from the `estimate` method are the robust standard errors obtained from the IF. The variance estimate and the parameters can be extracted with the `vcov` and `coef` methods. The IF itself can be extracted with the `IC` (or `influence`) method: ```{r ic1} IC(e) |> Print() ``` It is also possible to simultaneously estimate the proportions of each of the two binary outcomes ```{r inp2} estimate(inp) ``` or alternatively the input can be a model object, here a `mlm` object: ```{r mlm} e <- lm(cbind(y1, y2) ~ 1, data = dw) |> estimate() IC(e) |> head() ``` Different methods are available for inspecting an `estimate` object ```{r estimatemethods} summary(e) ## extract parameter coefficients coef(e) ## ## Asymptotic (robust) variance estimate vcov(e) ## Matrix with estimates and confidence limits estimate(e, level = 0.99) |> parameter() ## Influence curve IC(e) |> head() ## Join estimates e + e # Same as merge(e,e) ``` ## Example: generalized linear model For a \(Z\)-estimator defined by the score equation \(E[U(Z; \theta)] = 0\), the IF is given by \begin{align*} IC(Z; \theta) = \E\Big\{\frac{\partial}{\partial\theta^\top}U(\theta; Z)\Big\}^{-1}U(Z; \theta) \end{align*} In particular, for a maximum likelihood estimator the score, \(U\), is given by the partial derivative of the log-likelihood function. As an example, we can obtain the estimates with robust standard errors for a logistic regression model: ```{r glm} g <- glm(y1 ~ a + x1, data = dw, family = binomial) estimate(g) ``` We can compare that to the usual (non-robust) standard errors: ```{r glm.std} estimate(g, robust = FALSE) ``` The IF can be extracted from the `estimate` object or directly from the model object ```{r ifglm} IC(g) |> head() ``` The same estimates can be obtained with a *cumulative link regression* model which also generalizes to ordinal outcomes. Here we consider the proportional odds model given by \begin{align*} \log\left(\frac{\pr(Y\leq j\mid x)}{1-\pr(Y\leq j\mid x)}\right) = \alpha_{j} + \beta^{t}x, \quad j=1,\ldots,J \end{align*} ```{r ordreg} ordreg(y1 ~ a + x1, dw, family=binomial(logit)) |> estimate() ``` Note that the `sandwich::estfun` function from the `sandwich` library [@r_sandwich] can also estimate the IF for different parametric models, but does not provide the tools for combining and transforming these. ## Example: right-censored outcomess To illustrate the methods on survival data we will use the Mayo Clinic Primary Biliary Cholangitis Data [@therneau00surv] ```{r mets} library("survival") data(pbc, package="survival") ``` The Cox proportional hazards model can be fitted with the `mets::phreg` method which can estimate the IF for both the partial likelihood parameters and the baseline hazard. Here we fit a survival model with right-censored event times ```{r phreg, eval=has_mets } fit.phreg <- mets::phreg(Surv(time, status > 0) ~ age + sex, data = pbc) fit.phreg IC(fit.phreg) |> head() ``` The IF for the baseline cumulative hazard at a specific time point \begin{align*} \Lambda_0(t) = \int_0^t \lambda_0(u)\,du, \end{align*} where \(\lambda_0(t)\) is the baseline hazard, can be estimated in similar way: ```{r phreg-baseline, eval=has_mets } baseline <- function(object, time, ...) { ic <- mets::IC(object, baseline = TRUE, time = time, ...) est <- mets::predictCumhaz(object$cumhaz, new.time = time)[1, 2] estimate(NULL, coef = est, IC = ic, labels = paste0("chaz:", time)) } tt <- 2000 baseline(fit.phreg, tt) ``` The `estimate` and `IF` methods are also available for parametric survival models via `survival::survreg`, here a Weibull model: ```{r survreg, eval=has_mets } survival::survreg(Surv(time, status > 0) ~ age + sex, data = pbc, dist="weibull") |> estimate() ``` ## Example: random effects model / structural equation model General structural equation models (SEMs) can be estimated with `lava::lvm`. Here we fit a random effects probit model \[ \pr(Y_{ij} = 1 \mid U_{i}, W_{ij})=\Phi(\mu_{j} + \beta_{j} W_{ij} + U_{i}), \quad U_{i}\sim\mathcal{N}(0,\sigma_{u}^{2}),\quad j=1,2 \] to the simulated dataset ```{r semfit, eval=has_mets } sem <- lvm(y1 + y2 ~ 1 * u + w) |> latent(~ u) |> ordinal(K=2, ~ y1 + y2) semfit <- estimate(sem, data = dw) ## Robust standard errors estimate(semfit) ``` ## Example: quantile Let \(\beta\) denote the \(\tau\)th quantile of \(X\), with IF \begin{align*} \IC(x; \Pz) = \tau - \one(x\leq \beta)f_{0}(\beta)^{-1} \end{align*} where \(f_{0}\) is the density function of \(X\). To calculate the variance estimate, an estimate of the density is thus needed which can be obtained by a kernel estimate. Alternatively, the resampling method of [@zenglin2008] can be applied. Here we use a kernel smoother (additional arguments to the `estimate` function are parsed on to `stats::density.default`) to estimate the quantiles and IF for the 25%, 50%, and 75% quantiles of \(W\) and \(X_1\) ```{r quantiles} eq <- estimate(dw[, c("w", "x1")], type = "quantile", probs = c(0.25, 0.5, 0.75)) eq IC(eq) |> head() ``` # Combining influence functions A key benefit of working with the IFs of estimators is that this allows for transforming or combining different estimates while easily deriving the resulting IF and thereby asymptotic distribution of the new estimator. Let \(\widehat{\theta}_{1}, \ldots, \widehat{\theta}_{M}\) be \(M\) different estimators with decompositions \begin{align*} \sqrt{n}(\widehat{\theta}_{m}-\theta_{m}) = \frac{1}{\sqrt{n}}\sum_{i=1}^{n} \IC_m(Z_i; \Pz) + \op(1) \end{align*} based on iid data \(Z_1,\ldots,Z_n\). It then follows immediately [@vaart_1998_asymp Theorem 18.10[vi]] that the joint distribution of \(\widehat{\theta} - {\theta}= (\widehat{\theta}_{1}^{\top},\ldots,\widehat{\theta}_{M}^{\top})^\top- ({\theta}_{1}^{\top},\ldots,{\theta}_{M}^{\top})^\top \) is given by \begin{align*} \sqrt{n}(\widehat{\theta}-\theta) &= \frac{1}{\sqrt{n}}\sum_{i=1}^{n} \underbrace{[\IC_{1}(Z_i; \Pz)^\top,\ldots,\IC_{M}(Z_i; \Pz)^\top]^{\top}}_{\overline{\IC}(Z_i; \Pz)} + \op(1) \\ &\Dto \mathcal{N}(0,\Sigma) \end{align*} by the CLT, and under regulatory conditions \(\mathbb{P}_{n}\overline{\IC}(Z_i; \widehat{P})^{\otimes 2} \overset{P}{\longrightarrow}\Sigma\) as \(n\to\infty\). To illustrate this we consider two marginal logistic regression models fitted separately for \(Y_1\) and \(Y_2\) and combine the estimates and IFs using the `merge` method ```{r glmmarg} g1 <- glm(y1 ~ a, family=binomial, data=dw) g2 <- glm(y2 ~ a, family=binomial, data=dw) e <- merge(g1, g2) summary(e) ``` As we have access to the joint asymptotic distribution we can for example test for whether the odds-ratio is the same for the two responses: ```{r hypo1} estimate(e, cbind(0,1,0,-1), null=0) ``` More details an be found in the Section on [hypothesis testing](#linear-contrasts-and-hypothesis-testing). ## Imbalanced data Let \(O_{1} = (Z_{1}R_{1}, R_{1}), \ldots, O_{N}=(Z_{N}R_{N}, R_{N})\) be iid with \(R_{i}\indep Z_i\) and let the full-data IF for some estimator of a parameter \(\theta\in\mathbb{R}^p\) be \(IC(\cdot; \Pz)\). For convenience let the data be ordered \(R_{i}=\one(i\leq n)\) where \(n\) is the number of observed data points, then the complete-case estimator is consistent and based on same IF \begin{align*} \sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^n IC(Z_i; \Pz) + \op(1). \end{align*} This estimator can also be decomposed in terms of the observed data \(O_1,\ldots,O_N\) noting that \begin{align*} \sqrt{N}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{N}}\sum_{i=1}^N IC(Z_i; P)\frac{R_i N}{n} + \op(1). \end{align*} where the term \(\frac{R_i N}{n}\) corresponds to an inverse probability weighting with the empirical plugin estimate of the proportion of observed data \(R=1\). Under a missing completely at random assumption we can therefore combine estimators that are estimated on different datasets. Let the observed data be \((Z_{11}R_{11}, R_{11}, Z_{21}R_{21}, R_{21}), \ldots, (Z_{1N}R_{1N}, R_{1N}, Z_{2N}R_{2N}, R_{2N}))\) with complete-case estimators \(\widehat{\theta}_1\) and \(\widehat{\theta}_2\) for parameters \(\theta_1\) and \(\theta_2\) based on \((Z_{11}R_{11}, \ldots, Z_{1N}R_{1N})\) and \((Z_{21}R_{21}, \ldots, Z_{2N}R_{2N})\), respectively, and let the corresponding IFs be \(IC_{1}(\cdot; \Pz)\) and \(IC_{2}(\cdot;\ P)\). It then follows that \begin{align*} \sqrt{N}\left\{ \begin{pmatrix} \widehat{\theta}_1 \\ \widehat{\theta}_2 \end{pmatrix} - \begin{pmatrix} \vphantom{\widehat{\theta}_1}\theta_1 \\ \vphantom{\widehat{\theta}_1}\theta_2 \end{pmatrix} \right\} = \frac{1}{\sqrt{N}}\sum_{i=1}^N \begin{pmatrix} IC_1(Z_{1i}; \Pz)\frac{R_{1i}N}{R_{1\bullet}} \\ IC_2(Z_{2i}; \Pz)\frac{R_{2i}N}{R_{2\bullet}} \end{pmatrix} + \op(1) \end{align*} with \(R_{k\bullet} = \sum_{i=1}^{N}R_{ki}.\) Returning to the example, we can combine the marginal estimates of two model objects that have been estimated from different datasets (as the outcome \(Y_3\) is only available in half of the data). Here we will use the overloaded `+` operator ```{r glmmargmis} g2 <- glm(y2 ~ 1, family = binomial, data = dw) summary(g2) dwc <- na.omit(dw) g3 <- glm(y3 ~ 1, family = binomial, data = dwc) summary(g3) e2 <- estimate(g2, id = dw$id) e3 <- estimate(g3, id = "id", data=dwc) merge(e2,e3) |> IC() |> Print() vcov(e2 + e3) ## Same marginals as list(vcov(e2), vcov(e3)) ``` Note, it is also possible to directly specify the id-variables in the `merge` call: ```{r merge} merge(e2, e3, id = list(dw$id, dwc$id)) ``` In the above example the `id` argument defines the identifier that makes it possible to link the rows in the different IFs that should be glued together. If omitted then the `id` will automatically be extracted from the model-specific `IC` method (deriving it from the original data.frame used for estimating the model). This automatically works with all models and `IC` methods described in this document. ```{r estimatenoid} estimate(g2) |> IC() |> head() vcov(estimate(g2) + estimate(g3)) (estimate(g2) + estimate(g3)) |> (rownames %++% head %++% IC)() ``` To force that the id variables are not overlapping between the merged model objects, i.e., assuming that there is complete independence between the estimates, the argument `id=NULL` can be used ```{r merge_idnull} merge(g1, g2, id = NULL) |> (Print %++% IC)() merge(g1, g2, id = NULL) |> vcov() ``` ## Renaming and subsetting parameters To only keep a subset of the parameters the `keep` argument can be used. ```{r mergekeep} merge(g1, g2, keep = c("(Intercept)", "(Intercept).1")) ``` The argument can be given either as character vector or a vector of indices: ```{r merge2} merge(g1,g2, keep=c(1, 3)) ``` or as a vector of perl-style regular expressions ```{r merge3} merge(g1, g2, keep = "cept", regex = TRUE) merge(g1, g2, keep = c("\\)$", "^a$"), regex = TRUE, ignore.case = TRUE) ``` When merging estimates unique parameter names are created. It is also possible to rename the parameters with the `labels` argument ```{r merge4} merge(g1, g2, labels = c("a", "b", "c")) |> estimate(keep = c("a", "c")) merge(g1, g2, labels = c("a", "b", "c"), keep = c("a", "c") ) estimate(g1, labels=c("a", "b")) ``` Finally, the `subset` argument can be used to subset the parameters and IFs before the actual merging is being done ```{r merge5} merge(g1, g2, subset="(Intercept)") ``` ## Clustered data (non-iid case) Let \(Z_i = (Z_{i1},\ldots,Z_{iN_{i}})\) and assume that \((Z_{i}, N_{i}) \sim P\), \(i=1,\ldots,n\) are iid and \(N_i\indep Z_{ij}\). The variables \(Z_{i1},\ldots,Z_{iN_{i}}\) we assume are exchangeable but not necessarily independent. Define \(N = \sum_{i=1}^{n} N_i\), and assume that a parameter estimate, \(\widehat{\theta}\in\mathbb{R}^p\) has the decomposition \[ \sqrt{N}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{N}} \sum_{i=1}^{n} \sum_{k=1}^{N_{i}} IC(Z_{ik}; \Pz) + \op(1). \] It then follows that \[ \sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}} \sum_{i=1}^{n} \widetilde{\IC}(Z_{i}; \Pz) + \op(1) \] with \(\widetilde{\IC}(Z_{i}; \Pz) = \sum_{k=1}^{N_{i}} \frac{n}{N}IC(Z_{ik}; \Pz)\), \(i=1,\ldots,n\) which are iid an therefore admits the usual CLT to derive the asymptotic variance of \(\widehat{\theta}\). Turning back to the example data, we can estimate the marginal model ```{r cluster1} g0 <- glm(y ~ a + w + x, data = dl, family = binomial()) ``` The asymptotic variance estimate ignoring that the observations are not independent is not consistent. Instead we can calculate the cluster robust standard errors from the above iid decomposition ```{r cluster2} estimate(g0, id=dl$id) ``` We can confirm that this situation is equivalent to the variance estimates we obtain from a GEE marginal model with working independence correlation structure [@r_geepack] ```{r geepack, eval=has_geepack } gee0 <- geepack::geeglm(y ~ a + w + x, data = dl, id = dl$id, family=binomial) summary(gee0) ``` ## Computational aspects Working with large and potentially multiple different IFs can be memory-intensive. A remedy is to use the idea of aggregating the IFs by introducing a random coarser grouping variable. Following the same arguments as in the previous section, the aggregated IF will still be iid and allows us to estimate the asymptotic variance. Obviously, the same grouping must be used across estimates when combining IFs. ```{r aggregate} set.seed(1) y <- cbind(rnorm(1e5)) N <- 2e2 ## Number of aggregated groups, the number of observations in the new IF id <- foldr(nrow(y), N, list=FALSE) Print(cbind(table(id))) ## Aggregated IF e <- estimate(cbind(y), id = id) object.size(e) e ``` # IF building blocks: transformations and the delta theorem Let \(\phi\colon \mathbb{R}^p\to\mathbb{R}^m\) be differentiable at \(\theta\) and assume that \(\widehat{\theta}_n\) is RAL estimator with IF given by \(\IC(\cdot; \Pz)\) such that \begin{align*} \sqrt{n}(\widehat{\theta}_n - \theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \IC(Z_i; \Pz) + \op(1), \end{align*} then by the delta method [@vaart_1998_asymp Theorem 3.1] \begin{align*} \sqrt{n}\{\phi(\widehat{\theta}_n) - \phi(\theta)\} = \frac{1}{\sqrt{n}}\sum_{i=1}^n \nabla\phi(\theta)\IC(Z_i; \Pz) + \op(1), \end{align*} where \(\phi\colon \theta\mapsto (\phi_{1}(\theta),\ldots,\phi_{m}(\theta))^\top\) and \(\nabla\) is the partial derivative operator \begin{align*} \nabla\phi(\theta) = \begin{pmatrix} \tfrac{\partial}{\partial\theta_1}\phi_1(\theta) & \cdots & \tfrac{\partial}{\partial\theta_p}\phi_1(\theta) \\ \vdots & \ddots & \vdots \\ \tfrac{\partial}{\partial\theta_1}\phi_m(\theta) & \cdots & \tfrac{\partial}{\partial\theta_p}\phi_m(\theta) \\ \end{pmatrix}. \end{align*} Together with the ability to derive the joint IF from marginal IFs, this provides us with a powerful tool for constructing new estimates using the IFs as the fundamental building blocks. To apply the delta method the transformation of the parameters function must be supplied to the `estimate` method (argument `f`) ```{r delta1} estimate(g1, sum) estimate(g1, function(p) list(a = sum(p))) # named list ## Multiple parameters estimate(g1, function(x) c(x, x[1] + exp(x[2]), inv = 1 / x[2])) estimate(g1, exp) ``` The gradient can be provided as the attribute `grad` and otherwise numerical differentiation is applied. ## Example: Pearson correlation As a simple toy example consider the problem of estimating the covariance of two variables \(X_1\) and \(X_2\) \begin{align*} \widehat{\cov}(X_1,X_2) = \mathbb{P}_n(X_1-\mathbb{P}_n X_1)(Y_1-\mathbb{P}_n Y_1). \end{align*} It is easily verified that the IF of the sample estimate of \((\E X_{1}, \E X_{2}, \E\{X_{1}X_{2}\})^\top\) given by is \(\IC(X1,X2; \Pz) = (X_{1}-\E X_{1}, X_{2}-\E X_{2}, X_{1}X_{2}-\E\{X_{1}X_{2}\})^\top\). By the delta theorem with \(\phi(x,y,z) = z-xy\) we have \(\nabla\phi(x,y,z) = (-y -x 1)\) and thus the IF for the sample covariance estimate becomes \begin{align*} \IC_{x_1, x_2}(X_1, X_2; \Pz) = (X_1 - \E X_1)(X_2 - \E X_2) - \cov(X_1,X_2) \end{align*} We can implement this directly using the `estimate` function via the `IC` argument which allows us to provide a user-specificed IF and with the point estimate given by the `coef` argument ```{r cov} Cov <- function(x, y, ...) { est <- mean(x * y)-mean(x)*mean(y) estimate( coef = est, IC = (x - mean(x)) * (y - mean(y)) - est, ... ) } with(dw, Cov(x1, x2)) ``` As an illustration we could also derive this estimate from simpler building blocks of \(\E X_{1}\), \(\E X_{2}\), and \(\E(X_{1}X_{2})\). ```{r cov2} e1 <- lm(cbind(x1, x2, x1 * x2) ~ 1, data = dw) |> estimate(labels = c("Ex1", "Ex2", "Ex1x2")) e1 estimate(e1, function(x) c(x, cov=with(as.list(x), Ex1x2 - Ex2* Ex1))) ``` The variance estimates can be estimated in the same way and the combined estimates be used to estimate the correlation ```{r rho} e2 <- with(dw, Cov(x1, x2, labels = "c", id = id) + Cov(x1, x1, labels = "v1", id = id) + Cov(x2, x2, labels = "v2", id = id)) rho <- estimate(e2, function(x) list(rho = x[1] / (x[2] * x[3])^.5)) rho ``` by using a variance stabilizing transformation, Fishers \(z\)-transform [@lehmann2023_testing], \(z = \arctanh(\widehat{\rho}) = \frac{1}{2}\log\left(\frac{1+\widehat{\rho}}{1-\widehat{\rho}}\right)\), confidence limits with general better coverage can be obtained ```{r tanh} estimate(rho, atanh, back.transform = tanh) ``` The confidence limits are calculated on the \(\arctanh\)-scale and transformed back to the original correlation scale via the `back.transform` argument. In this case, where the estimates are far away from the boundary of the parameter space, the variance stabilizing transform does almost not have any impact, and the confidence limits agrees with the original symmetric confidence limits. ## TODO Example: survival ## Linear contrasts and hypothesis testing An important special case of parameter transformations are linear transformations. A particular interest may be formulated around testing null-hypotheses of the form \begin{align*} H_0\colon\quad \bm{B}\theta = \bm{b}_0 \end{align*} where \(\bm{B}\in\mathbb{R}^{m\times p}\) is a matrix of estimable contrasts and \(\bm{b}_0\in\mathbb{R}^{m}\). As an example consider marginal models for the binary response variables \(Y_1, Y_2, Y_3, Y_4\) ```{r} g <- lapply( list(y1 ~ a, y2 ~ a, y3 ~ a), #, y4 ~ a+x4), function(f) glm(f, family = binomial, data = dw) ) gg <- Reduce(merge, g) gg ``` A linear transformation can be specified via the `f` as a matrix argument instead of function object ```{r contrast1} B <- cbind(0,1, 0,-1, 0,0) estimate(gg, B) ``` The \(\bm{b}_0\) vector (default assumed to be zero) can be specified via the `null` argument ```{r estcontrast1} estimate(gg, B, null=1) ``` For testing multiple hypotheses we use that \((\bm{B}\widehat{\theta}-\bm{b}_0)^{\top} (\bm{B}\widehat{\Sigma}\bm{B}^{\top})^{-1} (\bm{B}\widehat{\theta}-\bm{b}_0) \sim \chi^2_{\operatorname{rank}(B)}\) under the null hypothesis where \(\widehat{\Sigma}\) is the estimated variance of \(\theta\) (i.e., `vcov(gg)`) ```{r estcontrast2} B <- rbind(cbind(0,1, 0,-1, 0,0), cbind(0,1, 0,0, 0,-1)) estimate(gg, B) ``` Such linear statistics can also be specified directly as expressions of the parameter names ```{r} estimate(gg, a + a.1, 2*a - a.2, a, null=c(2,1,1)) ``` We refer to the function `lava::contr` and `lava::parsedesigns` for defining contrast matrices. ```{r contr} contr(list(1, c(1, 2), c(1, 4)), n = 5) ``` A particular useful contrast is the following for considering all pairwise comparisons of different exposure estimates: ```{r pairwise.diff} pairwise.diff(3) estimate(gg, pairwise.diff(3), null=c(1,1,1), use=c(2,4,6)) ``` When conducting multiple tests each at a nominal-level of \(\alpha\) the overall type I error is not controlled at \(\alpha\)-level. The influence function also allows for adjusting for multiple comparisons. Let \(Z_{1},\ldots,Z_{p}\) denote \(Z\)-statistics from \(p\) distinct two-sided hypothesis tests which we will assume is asymptotically distributed under the null hypothesis as a zero-mean Gaussian distribution with correlation matrix \(R.\) Let \( Z_{max} = \max_{i=1,\ldots,p} |Z_i|\) then the family-wise error rate (FWER) under the null can be approximated by \[ P(Z_{max} > z) = 1-\int_{-z}^{z} \cdots \int_{-z}^{z} \phi_{R}(x_{1},\ldots,x_{p}) \,dx_{1}\cdots\,dx_{p} \] where \(\phi_{R}\) is the multivariate normal density function with mean 0 and variance given by the correlation matrix \(R\). The adjusted \(p\)-values can then be calculated as \[P(Z_{max} > \Phi^{-1}(1-p/2))\] where \(\Phi\) is the standard Gaussian CDF. As described in in [@RitzPipper_multcomp] the joint distribution of \(Z_{1},\ldots,Z_{p}\) can be estimated from the IFs. This is implemented in the `p.correct` method ```{r pcorrect, eval=has_mets } gg0 <- estimate(gg, use="^a", regex=TRUE, null=rep(.8, 3)) p.correct(gg0) ``` While this always yields a more powerful test compared to Bonferroni adjustments, a more powerful closed-testing procedure [@marcus1976], can be generally obtained by considering all intersection hypotheses. ![*Figure: closed testing via Wald tests of all intersections hypotheses.*](figs/closedtesting.pdf) To reject the \(H_{1}\) at an correct \(\alpha\)-level we should test all intersection hypotheses involving \(H_{1}\) and check if there all are rejected at the \(\alpha\)-level. The adjusted \(p\)-values can here be obtained as the maximum \(p\)-value across all the composite hypothesis tests. Unfortunately, this only works for relatively few comparisons as the number of tests grows exponentially. ```{r closedtesting, eval=has_mets} closed.testing(gg0) ``` ## Averaging Some parameters of interest are expressed as averages over functions of the observed data and estimated parameters of a model. The asymptotic distribution can in some of these cases also be derived from the influence function. Let \(Z_{1},\ldots,Z_{n}\) be iid observations, \(Z_{1}\sim \Pz\) and let \(X_{i}\subset Z_{i}\). Assume that \(\widehat{\theta}\) is RAL estimator of \(\theta\in\Omega\subset \mathbb{R}^{p}\) \[\sqrt{n}(\widehat{\theta}-\theta) = \frac{1}{\sqrt{n}}\sum_{i=1}^{n}\phi(Z_{i}; \Pz) + \op(1). \] Let \(f:\mathcal{X}\times\Omega\to\mathbb{R}\) be continuous differentiable in \(\theta\) \[\sqrt{n}\{f(X; \widehat{\theta})-f(X; \theta)\} = \frac{1}{\sqrt{n}}\nabla_{\theta}f(X;\theta)\sum_{i=1}^{n}\phi(Z_{i}; P) + \op(1). \] Let \(\Psi = \Pz f(X;\theta)\) and \(\widehat{\Psi} = P_{n} f(X;\widehat{\theta})\). \(\Pz\) and \(P_{n}\) are here everywhere the integrals wrt. \(X\). It is easily verified that \begin{align*} \widehat{\Psi}-\Psi &= (P_{n}-\Pz)(f(X; \theta)-\Psi) + P[f(X;\widehat{\theta})-f(X;\theta)] \\ &\quad + (P_{n}-\Pz)[f(X;\widehat{\theta})-f(X;\theta)] \end{align*} From Lemma 19.24 [@vaart_1998_asymp] it follows that for the last term \begin{align*} \sqrt{n}(P_{n}-\Pz)[f(X;\widehat{\theta})-f(X;\theta)] = \op(1) \end{align*} when \(f\) for example is Lipschitz and more generally when \(f(X;\theta)\) forms a \(\Pz\)-Donsker class. It therefore follows that \begin{align*} \sqrt{n}(\widehat{\Psi}-\Psi) &= \sqrt{n}P_{n}(f(X; \theta)-\Psi) + \frac{1}{\sqrt{n}}P\nabla_{\theta}f(X;\theta)\sum_{i=1}^{n}\phi(Z_{i}; \Pz) + \op(1) \\ &= \frac{1}{\sqrt{n}}\sum_{i=1}^{n}\{f(X;\theta)-\Psi\} + \frac{1}{\sqrt{n}}P\nabla_{\theta}f(X;\theta)\sum_{i=1}^{n}\phi(Z_{i}, \Pz) + \op(1) \end{align*} Hence the IF for \(\widehat{\Psi}\) becomes \[ IC(Z; \Pz) = f(X;\theta)-\Psi + [\Pz\nabla_{\theta}f(X;\theta)]\phi(Z). \] Turning back to the example we can estimate the logistic regression model \(\operatorname{logit}(E\{Y_1 | A,X_1,W\}) = \beta_0 + \beta_a A + \beta_{x_1} X_1 + \beta_w W\), and from this we want to estimate the target parameter \[ \theta(P) = \E_{P}[E(Y\mid A=1, X_{1}, W)]. \] To do this we need first to estimate the model and then define a function that gives the predicted probability \(\pr(Y=1\mid,A=a,X_{1},W)\) for any observed values of \(X_1,W\) but with the treatment variable \(A\) kept fixed at the value \(1\) ```{r estpred} g <- glm(y1 ~ a + x1 + w, data=dw, family=binomial) pr <- function(p, data, ...) with(data, expit(p[1] + p["a"] + p["x1"]*x1 + p["w"]*w)) pr(coef(g), dw) |> head() ``` The target parameter can now be estimated with the syntax ```{r average} id <- foldr(NROW(dw), 100, list=FALSE) ea <- estimate(g, pr, average=TRUE, id=id) ea IC(ea) |> head() ``` # SessionInfo ```{r sessionInfo} sessionInfo() ``` # Bibliography