--- title: "Fast and accurate parallel numerical derivatives in R" output: rmarkdown::html_vignette: toc: true toc_depth: 2 number_sections: true bibliography: "../inst/REFERENCES.bib" author: "Andreï V. Kostyrka, University of Luxembourg" date: "Created: 2024-10-01, last modified: 2024-10-01, compiled: `r Sys.Date()`" abstract: "We illustrate how to use the novel **pnd** package for **R** to obtain numerical derivatives, gradients, and Hessians and offer empirically backed advice for common situations where finite-difference approximations are unreliable." keywords: "numerical differentiation, approximation error, finite differences, parallel computing" vignette: > %\VignetteIndexEntry{Fast and accurate parallel numerical derivatives in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::knit_hooks$set(pngquant = knitr::hook_pngquant) knitr::opts_chunk$set( dev = "png", dev.args = list(type = if (Sys.info()["sysname"] == "Darwin") "quartz" else "cairo-png"), fig.width = 512 / 72, fig.height = 320 / 72, out.width="10cm", dpi = 72, fig.retina = 1, collapse = TRUE, comment = "#>" ) if (.Platform$OS.type == "unix") knitr::opts_chunk$set(pngquant = "--speed=1 --quality=50-60") ``` # Introduction As of 2025, most popular R numerical optimisers and functions for computing numerical derivative approximations continue to rely on serial evaluation of functions, and even in scenarios where parallelisation is feasible, users often lack a convenient parallel solution. The `pnd` package addresses this lacuna by providing functions that: 1. Use multiple CPU cores to reduce numerical derivative computation time; 2. Choose the optimal finite-difference step size for desired derivative and accuracy orders; 3. Reuse existing function values to minimise evaluations or increase approximation accuracy; 4. Report an estimated measure of numerical inaccuracy. The first feature alone can significantly expedite optimisation in applied research by harnessing the full power of multi-core machines. Some R packages offer finite-difference-based derivatives with either hard-coded or suboptimal step sizes (e.g. `optim()` in the core \texttt{stats} @rcore for gradient-based methods), potentially leading to misleading results with quasi-Newton optimisers and lacking error diagnostic tools. Technical research articles and book chapters on numerical methods provide optimality results and explicit step-size formulæ, but user guides and package vignettes often lack practical insights into the principles underlying the selection of tuning parameters. This vignette showcases the core functionality of the `pnd` package. Despite the proliferation of parallelism-supporting R packages (e.g. `psqn`, `DEoptim`, `hydroPSO`), few implementations address gradient-related numerical optimisation issues. The `optimParallel` package by @optimParallel offers an internally parallelised L-BFGS-B method for the `optim()` optimiser.^[The problem with the L-BFGS-B method in R is its reliance on finite function values; the BFGS method admits infinite function values during the linear improvement search, but its limited-memory counterpart with box constraints throws an error upon encountering \texttt{NaN}, \texttt{NA}, or $\pm$\texttt{Inf}. This limits the scope of the `optimParallel` solution to a narrow family of well-behaved functions or functions modified by the user to replace non-finite values with an arbitrary large number.] As of May 2023, it does not support algorithms other than L-BFGS-B, which is problematic given the abundance of quasi-Newton solvers (`nlm`, `nlmimb`, `nloptr::nloptr`, `Rsolnp::solnp`, `BB::spg`, `maxLik::maxBHHH`), all of which would benefit from parallelised gradients or Hessians. To the best of our knowledge, no R packages currently address the problem of step size in finite-difference derivative approximation comprehensively. The `numDeriv` package offers facilities for controlling the accuracy order, step size, and other tuning parameters, but does not output diagnostic information. Users must manually request the diagnostic print-out themselves, which works for gradients but not Hessians.^[`numDeriv:::grad.default()` prints out the Richardson extrapolation steps for first derivatives, but not cross-derivatives, which rely on `numDeriv:::genD.default()` instead.] The `pnd` package tackles this problem by offering three explicit data-driven step-search routines with detailed diagnostic information. `numDeriv` and `pnd` differ fundamentally in attaining higher-order accuracy. `numDeriv` allows users to set the initial step size and request a certain number of Richardson improvement iterations to mitigate the truncation error, which run sequentially. In contrast, `pnd` achieves desired accuracy by creating a necessary argument-value grid and evaluating the function in parallel, benefitting both high-dimensional applications like Barzilai--Borwein optimisation and low-dimensional gradients and Hessians. The results are comparable in effective accuracy, with minor differences attributable to initial step size choice and subsequent extrapolation rules in `numDeriv`. In summary, the `pnd` package provides fast and accurate numerical differentiation and diagnostic features for numerical gradients, Jacobians, Hessians, and higher-order derivatives. This vignette includes theory and practical examples for step size selection in numerical derivatives. For the remainder of the paper, we shall use the following abbreviations and notation conventions: * FP = floating-point, FD = forward differences, CD = central differences, BD = backward differences. * $f^{(i)}$ denotes the $i$^th^ derivative of $f$ (order 5 and higher), where $i$ is a Roman numeral. * $x^{(n)}$ denotes the $n$^th^ coordinate of $x$, where $n$ is an Arabic numeral. * $\hat f(x)$ or $[f(x)]$ denotes the finite-precision machine version of $f$ evaluated at $x$. Brackets are better for denoting the compounded error in expressions with chained operations. * $\varepsilon_{f}$ denotes the machine-rounding error between the true value and its FP representation: $\varepsilon_{f} := f - \hat f$. * $\epsilon_{\mathrm{m}}$ denotes the machine epsilon, i.e.\ the smallest positive FP number $a$ such that $[1+a] \ne 1$. Equals on modern 64-bit machines to $2^{-52} \approx 2.22\cdot 10^{-16}$ if the IEEE double-precision FP standard is used.^[Other works (e.g. @goldberg1991what) give a different definition of the machine epsilon -- the relative error bound. In our notation, any real number is rounded to the closest FP number with relative error less that $\epsilon_{\mathrm{m}}/2$.] This vignette demonstrates the capabilities of the `pnd` package, thus users should attach it before executing the code in subsequent sections: ```{r setup} library(pnd) ``` # Derivative approximation via finite differences ## Derivatives from Taylor series \label{sec:basic} The true (analytical) derivative of the scalar-valued function $f$ at the point $x$ is defined as the limit \[ f'(x) := \lim_{h\to0} \frac{f(x + h) - f(x)}{h}. \] The usual caveat applies: $f(x)$ must be differentiable at $x$, i.e.\ the limit must exist. In many applications where analytical derivatives $f$ are cumbersome or impossible to compute but the value of $f'(x)$ is needed, one intuitive approach is to remove the limit in the definition of the derivative and considering the finite difference \[ f_{\mathrm{FD}_1}' (x, h) := \frac{f(x + h) - f(x)}{h}. \] This is the first-order-accurate forward-difference approximation of $f$. One could *hypothetically* choose a sequence of decreasing step sizes $h_i$ (e.g. $\{0.1, 0.01, 0.001, \ldots\}$), and observe that the sequence $f'_{\mathrm{FD}} (x, 0.1), f'_{\mathrm{FD}} (x, 0.01), f'_{\mathrm{FD}} (x, 0.001), \ldots$ approaches a certain number as $h\to 0$. In practice, however, computing this sequence on a computer can lead to erratic behavior for very small $h$. This instability arises from the discrepancy between the true $f'_{\mathrm{FD}}$ and its machine evaluation, $\hat f'_{\mathrm{FD}}$. The nature of this discrepancy is investigated in what follows. The first-order-accurate backward difference and the second-order-accurate central difference are defined similarly: \[ f'_{\mathrm{BD}_1} := \frac{f(x)-f(x-h)}{h}, \qquad f'_{\mathrm{CD}_2} := \frac{0.5 f(x+h) - 0.5f(x-h)}{h} \] Finite-difference-based approximations of derivatives receive the subscripts ‘FD’, ‘BD’, or ‘CD’, and are labelled ‘first-order-accurate’ and ‘second-order-accurate’, denoted by a digit in the subscript, based on the approximation error arising from truncating the Taylor expansion of $f$ at $x$: \[ f(x \pm h) = \sum_{i=0}^{\infty} \frac{1}{i!} \frac{\mathrm{d}^i f}{\mathrm{d} x^i} (x) (\pm h)^i = f(x) \pm \frac{h}{1!} f'(x) + \frac{h^2}{2!} f''(x) \pm \frac{h^3}{3!} f'''(x) + \ldots \] Rearranging the terms and dividing by $h$: \[ f'(x) = \frac{f(x+h) - f(x)}{h} - \frac{h}{2} f''(x) - \frac{h^2}{6} f'''(x) - \ldots \] The **accuracy order** is defined as the smallest exponent of $h$ in the truncated part of the Taylor series (since $h\to0$, the lowest power dominates the remainder): \[ f'(x) = \frac{f(x+h) - f(x)}{h} - \frac{h}{2} f''(x) - \frac{h^2}{6} f'''(x + \alpha h) = f'_{\mathrm{FD}_1} - \frac{f''(x)}{2} h + O(h^2), \] where $\alpha \in [0, 1]$. The **truncation error** is the difference between the true function and the first few terms of its Taylor expansion: \[ \varepsilon_{\mathrm{t}}(h) := f'(x) - f'_{\mathrm{FD}_1}(x, h) \] In forward differences, the dominant term in the truncation error is $\frac{f''(x)}{2} h^1$ -- the first power of~$h$ multiplied by a constant, hence the term **first-order accuracy**; since $h\to0$, we ignore the Lagrange remainder $O(h^2)$.^[Some confusion may stem from the fact that the derivative order in the error term is 2, $f''(x)$; the accuracy order relates to the step size power, not the derivative order in the Taylor expansion.] We reserve the letter $a$ for the accuracy order -- requested by the user and achievable by carrying out multiple function evaluations -- and the letter $m$ for the derivative order. Central differences are second-order-accurate owing to the cancellation of even-power terms in the Taylor expansion: \[ f(x+h) - f(x-h) = (f(x) - f(x)) + (h - (-h)) f'(x)+ (h^2/2 - h^2/2) f''(x) \\ + \frac{h^3}{6} f'''(x+\alpha_1 h) + \frac{h^3}{6} f'''(x - \alpha_2 h) \] where $\alpha_1, \alpha_2 \in [0, 1]$. To simplify this expression, we use the Generalsed Intermediate Value Theorem (GIVT). By the generalised intermediate value theorem, \[ \frac{h^3}{6} f'''(x+\alpha_1 h) + \frac{h^3}{6} f'''(x - \alpha_2 h) = (h^3/6 + h^3/6) f'''(x+\alpha h) = \frac{h^3}{3} f'''(x+\alpha h), \] for some $\alpha \in [-1, 1]$. It follows that \[ f(x+h) - f(x-h) = 2h f'(x) + \frac{h^3}{3} f'''(x + \alpha h) \] and \[ f'_{\mathrm{CD}_2}(x) = \frac{0.5(2hf'(x) + \frac{h^3}{3} \cdot f'''(x+\alpha h) )}{h} = f'(x) + \frac{h^2}{6} f'''(x + \alpha h). \] The respective truncation error is $f'(x) - f'_{\mathrm{CD}_2}(x) = -\frac{f'''(x + \alpha h)}{6} h^2 = O(h^2)$. ## Derivatives on arbitrary stencils \textcolor{red}{!!! TODO: not checked} Function evaluations may be costly. In certain large-scale applications, it may be necessary to obtain a numerical derivative $f'(x_0)$ when the prohibitively slow $f$ was evaluated on a supercomputer with arguments $x_1$, $x_2$, and $x_3$. For such cases, an approach is necessary to use existing cached function values for approximating derivatives at arbitrary points. At the same time, having more than two function evaluations at distinct points grants an opportunity to cancel out more terms in the Taylor expansion of the first derivative for a smaller remainder term. We adopt the following terminology, with some adjustments, from @fornberg1988generation. For a sequence $\{b_i\}_{i=1}^{n}$ let $x + b_1 h, x + b_2 h, \ldots, x + b_n h$ be the **evaluation grid** for a function $f$. The column vector $(b_1, \ldots, b_n)'$ with all distinct elements is called the **stencil**. In the `pnd` package, users may rely on the default integer-valued stencil centred at 0 or choose a custom stencil. For this stencil, there exist such **finite-difference coefficients** $(w_1, \ldots, w_n)'$ that, when used as weights in the sum $\sum_{i=1}^n w_i f(x + b_i h)$, approximate the $\mathrm{d}^mf/\mathrm{d}x^m$ up to the order of accuracy $a$: \begin{equation}\label{eq:wsum} \frac{\mathrm{d}^m f}{\mathrm{d} x^m} (x) = h^{-m} \sum_{i=1}^n w_i f(x + b_i h) + O(h^{a}) \end{equation} It is necessary that $n > m$, yielding a formal accuracy order $a \ge n - m$ (in general).^[We write `formal' to distinguish between the desired accuracy and the finite machine precision discussed in the next section.] The intuition behind this derivative approximation via weighted sums is twofold. 1. **More evaluations, higher accuracy.** Derivatives indicate the instantaneous rate of change of $f$ in a small neighbourhood where the function resembles a line. It is possible to choose two points, $(x_1, f(x_1)), (x_2, f(x_2))$, close to the point of interest to construct a linear approximation of $f$. More generally, $n$~points can be used to construct an interpolating polynomial of degree~$(n-1)$ and to compute the derivative of the power series analytically at any point. The higher the polynomial degree, the better the approximation within the convergence radius, hence any departures of $f$ from linearity are captured by higher-order polynomial terms available with more points per evaluation grid. 2. **More evaluations, higher derivative order.** Given a stencil of length $n$ and the values of $f$ at the corresponding grid, one can construct Taylor expansions of $f$ at those points and find such coefficients for their linear combination that all derivatives of order 0 through $(n-2)$ should cancel out and the sum of coefficients on the $(n-1)$\textsuperscript{th} derivative should become one. This implies that stencils with more points should allow the researcher to compute linear combinations of functions whose Taylor expansion terms cancel out in the positions before and after the derivative of interest. For a stencil $(b_1, \ldots, b_{n})$, and for derivative order $m$, the weights that zero out the sums of derivatives of orders other than $m$ and result in a unit coefficient on the $m$\textsuperscript{th} derivative must satisfy \[ \begin{pmatrix} b_1^0 & \cdots & b_n^0 \\ \vdots & \ddots & \vdots \\ b_1^{n-1} & \cdots & b_n^{n-1} \end{pmatrix} \begin{pmatrix} w_1 \\ \vdots \\ w_n \end{pmatrix} = \begin{pmatrix} 0 \\ \vdots \\ m! \\ \vdots \\ 0 \end{pmatrix} \iff B \vec w = \vec m \] where the non-zero entry $m!$ of the vector $\vec m$ occurs in the $(m+1)$\textsuperscript{st} position. The solution is \[ \vec w = B^{-1} \vec m, \] provided that $B$ is invertible -- implying that the elements of the stencil must be distinct. Another condition follows from the definition of $\vec m$: for the element $m!$ to be placed in the position $m+1$, the stencil length must be at least $m+1$ (one more than the derivative order). The optimal weights $\vec w$ do not depend on $h$ because the weighted sum in Eq.~\eqref{eq:wsum} is already normalised by factor $h^{-m}$. The weights for stencils $(-2, -1, 1, 2)$ and $(-10, -5, 5, 10)$ are therefore identical. The idea is to find such coefficients $\{w_i\}_{i=1}^n$ that the derivatives of order less than $m$ cancel out and the coefficients on the $m$^th^ derivative add up to unity. Our implementation uses the @bjorck1970solution algorithm to avoid inverting the matrix the numerically ill-conditioned Vandermonde matrix $B$. Using numerical routines relying on matrix inversion (such as LAPACK) invokable through `base::solve()` is not recommended due to the rapid deterioration of the solution accuracy. See Appendix A in @kostyrka2025what for an investigation of this numerical instability. By default, `fdCoef()` assumes central differences for second-order-accurate first derivatives, but every aspect of numerical differencing can be customised. * `fdCoef()` without any arguments returns a list with stencil $(-1, 1)$ and weights $(-0.5, 0.5)$, translating to the approximation $f'_{\mathrm{CD_2}}(x, h) = (-0.5 f(x-h) + 0.5 f(x+h))/h$. * Requesting higher-order accuracy via `fdCoef(acc.order = 4)` yields a larger stencil for the default derivative order ($m=1$) via central differences because more evaluations are needed to make $a-1$ terms cancel out. * Requesting higher-order derivatives via `fdCoef(deriv.order = 3)` yields the minimally sufficient stencil for second-order accuracy ($a=2$) via central differences. * Requesting forward differences yields a stencil starting at zero: `fdCoef(deriv.order = 2, side = "forward")` yields the minimally sufficient non-negative stencil for $m=2$, $a=2$. * The user may request a custom stencil: `fdCoef(deriv.order = 3, stencil = c(-3, -1, 1, 3))` calculates the weights for second-order-accurate $f'''_\Delta (x)$ obtained by evaluating $f$ at $x+(-3, -1, 1, 3)\cdot h$. Central differences (or, more generally, stencils with both positive and negative points) yield *higher accuracy* for the same number of evaluations or require *fewer evaluations* to attain the desired accuracy order. Computing fourth-order-accurate second derivatives (e.g. for a very accurate Hessian diagonal) can be done on stencils $\{0, \pm 1, \pm 2\}$ (5 evaluations) or $\{0, 1, \ldots, 6\}$ (7 evaluations). The practical reason to prefer forward differences can be the extreme computation time of $f$; if $f(x)$ has been pre-computed, evaluating $(f(x+h) - f(x))/h$ is faster than $0.5 (f(x+h) - f(x-h))/h$. If $f(x)$ has not yet been computed, both $f_{\mathrm{FD}_1}$ and $f_{\mathrm{CD}_2}$ require 2 evaluations, so the latter is preferable for accuracy reasons. The output of `fdCoef()` for 5-point stencils matches formulæ 5.1--5.4 from @li2005general. ## Precision loss on computers Numerical methods are often preferred over analytical ones. With highly non-linear models conquering applied research, providing analytical gradients becomes less feasible. Their derivation can be cumbersome and, due to the approximate nature of statistical methods, only marginally useful compared to their more affordable numerical approximations. Most numerical optimisation methods rely on gradients to determine the search direction, which is why the user is supposed to provide some form of gradient function. Without a gradient function, a gradient-based optimisation method might attempt to compute $f'_{\mathrm{CD}_2}$ with default parameters that can be suboptimal or even perilous. After the optimiser converges (hopefully with zero exit code) to a certain optimal value of the argument (usually called the *estimate*), many implementations of statistical methods calculate a measure of uncertainty in the estimate due to the data randomness through the objective function curvature. This measurement often involves the Hessian matrix, which also needs to be computed or at least approximated. For example, if the parameter vector $\theta$ is estimated via maximum likelihood, assuming that the model specification is correct and the conditional density of the model error is fully known up $\theta$, the asymptotic variance-covariance of the estimator is often computed as the inverse of the observed Fisher information at the optimum -- the negative Hessian of the log-likelihood function. The accuracy of Hessian calculation depends on the step size, and the step size that minimises the numerical gradient error is unlikely to minimise the numerical Hessian error. Arguably, accurate Hessians are more important for applied researchers than accurate gradients because it is the inverse of the Hessian matrix that is used for inferential calculations. Matrix inversion, being a numerically unstable operation, propagates and magnifies any inaccuracies in the computed Hessian, often by several orders of magnitude. The problem of derivative approximation boils down to limited precision in the default numerical routines written in most popular programming languages. Every number in computer memory is represented by a finite number of bits, which is why **arithmetic operations on computers are often lossy**. Except for special lossless operations like multiplying by powers of 2 via bit shifting or subtracting two numbers separated by less than one binary order of magnitude (Sterbenz lemma), most operations with most real numbers lead to **machine-rounding errors**. Even converting decimal to binary is lossy: merely entering the number 0.3 into computer memory already introduces an error because the closest representable binary number is slightly less than 3/10. In our notation, $[3/10] \ne 3/10$. To accurately compute numerical derivatives, researchers should track every potentially lossy operation in computer memory. This allows bounding the error, comparing the losses, and focusing on the preventable ones. The expression $\hat f'_{\mathrm{FD}_2}$ implies a 3-step procedure, namely $\left[\frac{[\hat f([x+h]) - \hat f(x)]}{h}\right]$. Assuming that $x$ is a value already loaded in the memory: 1. $x+h$ is computed by the machine (to the best available precision, usually 64-bit FP); 1. $f$ is evaluated at $x$ and $[x+h]$ (to the same precision); 1. The outermost arithmetic operations are performed: addition and division by $h$ (to the same precision). Thus, 3 types of errors appear: \[ \left[\frac{\hat f([x+h]) - \hat f(x)}{h}\right] = \frac{\bigl( f(x + h + \varepsilon_+) + \varepsilon_{f,1} - f(x) - \varepsilon_{f, 2}\bigr) + \varepsilon_-}{h} + \varepsilon_{/}. \] 1. $\varepsilon_+$ is the **argument-rounding** error; 1. $\varepsilon_{f,1}$ and $\varepsilon_{f,2}$ are the **function-rounding** errors; 1. $\varepsilon_-$ and $\varepsilon_{/}$ are the **arithmetic-rounding** errors. The arithmetic-rounding errors may lead to **catastrophic cancellation**: if two real numbers, $x_1$ and $x_2$, are separated by a very small difference, $|x_1 - x_2| \approx 0$, the machine representation of their difference, $[\hat x_1 - \hat x_2]$, has a potentially unbounded high relative error. Example: `1.000000000000001 - 0.999999999999999` evaluates to `2.109424e-15` instead of `2e-15`, making the relative error higher than 5%. It is common to ignore the argument-rounding errors because they are usually small compared to the function-rounding error. In subsequent analysis, argument-rounding errors are ignored as it is always possible to choose such step size $h$ that $[x+h] = x+h$ without any accuracy loss (usually achieved by considering $h = 2^{-i}$, $i \in \mathbb{Z}$, making $x+h$ a lossless bit-flip operation). For a rigorous analysis of the rounding error due to the binary representation error, see Table 3.1 in @mathur2012analytical; the recommendation of choosing $h$ as a power of 2 can be found *ibidem*. It is also common to ignore the arithmetic-rounding errors in the numerator. Therefore, it suffices to bound the numerator error, calculate its amplification due to division by $h^m$, and minimise the result with respect to~$h$. The terms $f(x+h)$ and $f(x-h)$ usually have the same binary exponent; thus, by the Theorem 11 in @goldberg1991what, $[f(x+h) - f(x-h)]$ is exact. Even for higher-order derivatives, the same order of magnitude of $f(x+h), f(x+2h),\ldots$ implies that in most cases, no radix shifting is required for IEEE~794-compliant addition, and the round-off error is due only to the normalisation of the sum. This normalisation error has a high probability of being zero, depending on the least significant bits. Similarly, multiplication by integer powers of 2 is achieved through incrementing or decrementing the (usually 11-bit) exponent without changing the significand, unless the number is too large or too small and some non-zero bits of the mantissa are lost due to shifting. Apart from the Taylor-series-related **truncation error**, the next source of inaccuracy is the **rounding error** $\varepsilon_{\mathrm{r}}$ -- the discrepancy between the theoretical finite-difference value and its machine evaluation equal to the sum of the argument-, function-, and arithmetic-rounding errors. The **total error**, $\varepsilon := \varepsilon_{\mathrm{t}} + \varepsilon_{\mathrm{r}}$, is the sum of the ‘theoretical’ truncation and ‘practical’ rounding error. Each error can be bounded in absolute terms: $|\varepsilon_{\mathrm{t}}| < \overline{\varepsilon_{\mathrm{t}}}$, $|\varepsilon_{\mathrm{r}}| < \overline{\varepsilon_{\mathrm{r}}}$. Then, define the total-error function $\varepsilon(h) := \overline{\varepsilon_{\mathrm{t}}}(h) + \overline{\varepsilon_{\mathrm{r}}}(h)$ as the objective discrepancy measure to be minimised; the **optimal step size**, $h^* := \arg\min_h \varepsilon(h)$, minimises the total error. For the sake of brevity, the dependence of the $\varepsilon$ function on $f$, $h$, and machine precision is omitted. Frustratingly, there is no universal ‘one-size-fits-all’ numerical difference step size $h$. * If $h$ is too small, the error is mainly due to precision loss in the calculation of $\hat f([x+h])$ and the amplification of the sum of the numerator errors by the factor $1/h^m$; * If $h$ is too large, the approximation $f'(x) \approx \frac{f(x+h) - f(x)}{h}$ becomes poor due to the omitted Taylor series terms. Higher non-linearity of $f(x)$ increases the error from higher-order terms evaluated at the point $x+h$ far from $x$. * Derivatives of different orders depend differently on the step size: division by $h$ in $f'_{\mathrm{CD}_2}$ and by $h^2$ in $f''_{\mathrm{CD}_2}$ implies that the same value $h_0$ leads to potentially larger rounding errors in the evaluation of $f''$, even if both $f'_{\mathrm{CD}_2}$ and $f''_{\mathrm{CD}_2}$ have the same order of accuracy, $O(h^2)$. Thus, the term ‘accuracy order’ might be misleading to statistical software users since *it is impossible to request arbitrary numerical derivative accuracy from a computer*. Instead, one can only bound the total loss. The goal is to find $h^*$ for a given application and obtain the *best* approximation of $\frac{\mathrm{d}^m f}{\mathrm{d} x^m}(x)$ on a given machine for a given $x$. The word ‘best’ is used in the sense that although by pure coincidence, the total rounding error might be equal to zero, $h^*$ is ‘optimal’ if it ensures that the total-error function is minimised for most inputs in the small neighbourhood of $x$. # Numerical derivatives of scalar functions The derivations below follow the logic of Section 5.1 from @sauer2017numerical, without assuming that the order of magnitude of $f(x)$ is approximately 1. Differences are written as weighted sums of $f$ evaluated on a stencil sorted in ascending order in the spirit of @fornberg1988generation, reflecting the flexibility to use arbitrary stencils, useful if the function evaluation is expensive and existing values must be reused. This also contains all multipliers affecting the rounding error in the numerator of the fraction, simplifying calculations. ## Two-sided derivatives Consider the central-difference approximation of the first derivative on the default uniformly spaced stencil: \[ f'(x) = \frac{-0.5f(x-h) + 0.5f(x+h)}{h} + O(h^2) \] The truncation error for this expression is \[ f'(x) - f'_{\mathrm{CD}_2}(x) = - \frac{h^2}{6} f'''(x + \alpha h), \] where $\alpha \in [-1, 1]$. This means that apart from the freely chosen $h$, the approximation error depends on the 3^rd^ derivative of $f$, which is usually unknown. In computer memory, $f(x+h)$ is replaced with its FP version $\hat f(x+h) = f(x+h) + \varepsilon_f$, where $\varepsilon_f$ is a number on the order of machine epsilon in relative terms: \[ \frac{|\hat f(x) - f(x)|}{|f(x)|} \le \frac{\epsilon_{\mathrm{m}}}{2} \quad \Rightarrow \quad |\hat f(x) - f(x)| = |\varepsilon_f| \le 0.5|f(x)|\epsilon_{\mathrm{m}} \] Each function-rounding error is thus bounded: \[ |\hat f(x\pm h) - f(x\pm h)| \le 0.5|f(x\pm h)|\epsilon_{\mathrm{m}} \approx 0.5|f(x)|\epsilon_{\mathrm{m}} \] Expanding the numerator: \[ -0.5 \hat f(x-h) + 0.5\hat f(x+h) = -0.5 (f(x-h) + \varepsilon_{f,1}) + 0.5(f(x+h) + \varepsilon_{f,2}) \\ = 0.5(-f(x-h) + f(x+h)) + 0.5(\varepsilon_{f,2} - \varepsilon_{f,1}) \] The function-rounding error is bounded due to the triangle inequality: \[ 0.5(\varepsilon_{f,2} - \varepsilon_{f,1}) \le 0.5(|\varepsilon_{f,2}| + |\varepsilon_{f,1}|) \le 0.5(2\cdot 0.5|f(x)|\epsilon_{\mathrm{m}}) = 0.5|f(x)|\epsilon_{\mathrm{m}} \] The total error decomposition into truncation and rounding errors follows: \[ f'(x) - \hat f'_{\mathrm{CD}_2}(x) = f'(x) - \frac{0.5([-\hat f(x-h)] + [\hat f(x+h)])}{h} \\ = \underbrace{f'(x) - \frac{0.5(-f(x-h) + f(x+h))}{h}}_{\varepsilon_{\mathrm{t}}} + \underbrace{\frac{0.5(\varepsilon_{f_2} - \varepsilon_{f_1})}{h}}_{\varepsilon_{\mathrm{r}}} \\ = \frac{h^2}{6}f'''(x + \alpha h) + \frac{0.5(\varepsilon_{f_2} - \varepsilon_{f_1})}{h} \le \frac{h^2}{6}f'''(x + \alpha h) + \frac{0.5|f(x)|\epsilon_{\mathrm{m}}}{h} \] We use an extra approximation $f'''(x + \alpha h) \approx f'''(x)$ because this term is not the largest source of error in this expression. The absolute error of the machine approximation of $f'(x)$ becomes bounded by the conservative approximation of the total-error function \textcolor{red}{!!! Add subscript $e_1$, like in $e_3$ and $e_4$ henceforth} \[ \varepsilon(h) := \frac{|f'''(x)|}{6}h^2 + 0.5|f(x)|\epsilon_{\mathrm{m}} h^{-1}. \] Minimising $\varepsilon(h)$ with respect to $h$ requires the first-order condition \[ \varepsilon'(h) = \frac{|f'''(x)|}{3}h - 0.5|f(x)|\epsilon_{\mathrm{m}} h^{-2} = 0, \] which can be solved for $h > 0$: \[ \frac{|f'''(x)|}{3}h = 0.5|f(x)|\epsilon_{\mathrm{m}} h^{-2} \quad \Rightarrow \quad h^*_{\mathrm{CD}_2} = \sqrt[3]{\frac{1.5 |f(x)| \epsilon_{\mathrm{m}}}{|f'''(x)|}} \] This expression has two problems: 1. Recursivity: we approximate the unknown first derivative with a finite difference, and the optimal step size involves the unknown third derivative. One may assume any order of its magnitude, e.g.\ $|f'''(x)| = 1$, or use a rough numerical plug-in ‘guesstimate’ of $f'''(x)$ calculated on the smallest sufficient stencil (obtained via `fdCoef(3)`) and any reasonable *ad hoc* step size. $10^{-4}$ seems an acceptable plug-in value of $h$ for $f'''_{\mathrm{CD}_2}$; @dumontet1977determination describe a better approach, which we summarise in a dedicated section below. If computing a rough version of $f'''(x)$ is too costly, assume $|f(x) / f'''(x)| \approx 2/3$, which yields the simplest rule of thumb $h^* \approx \sqrt[3]{\epsilon_{\mathrm{m}}} \approx 6\cdot 10^{-6}$. If $f$ is suspected to have a high curvature, decrease $h$, and if $f$ changes at a steady rate, increase it. 2. If $f'''(x) \approx 0$, then, the optimal step size calculation involves division by near-zero numbers. Quadratic objective functions would be one such example: $f'''(x) = 0\ \forall x \in \mathbb{R}$. In this case, $f'(x)$ is a linear function that is defined by two points; therefore, any larger step size can be used to determine the coefficients of this line equation, yielding a near-zero truncation error. If $f(x)$ is not linear and has $f'''(x) \approx 0$, but $|f''''(x)|$ and higher-order derivatives are not zero, data-driven procedures may be needed to find the optimal step size, although $O(h^3)$ might be sufficiently close to zero to ignore it completely. The decrease in accuracy due to incorrect assumptions about $|f(x) / f'''(x)|$ depends on the ratio. If $|f'''(x)| \approx |f(x)| \approx 1$ at some point $x$, the total-error function is more sensitive to larger $h$ in relative terms, but to smaller $h$ in absolute in the neighbourhood of $h^*$, which can be seen in the logarithmic and linear plots: ```{r} hseq <- 10^seq(-9, -3, length.out = 101) hopt <- (1.5 * .Machine$double.eps)^(1/3) e <- function(h) h^2/6 + 0.5*.Machine$double.eps / h plot(hseq, e(hseq), log = "xy", xlab = "Step size", ylab = "Total error", asp = 1, bty = "n", type = "l", main = "Inaccuracy of CD-based derivatives (logarithmic)") abline(v = hopt, lty = 2) hseq2 <- seq(hopt - 5e-6, hopt + 5e-6, length.out = 101) plot(hseq2, e(hseq2), xlab = "Step size", ylab = "Total error", bty = "n", type = "l", main = "Inaccuracy of CD-based derivatives (linear)") abline(v = hopt, lty = 2) ``` With this optimal step size, the total error function attains its minimum value of order $O(\epsilon_{\mathrm{m}}^{2/3})$, translating to 10--11 accurate decimal significant digits. \textcolor{red}{!!! Give values for the error at the optimum in all subsequent sections} Other bounds for the numerator round-off error exist in the literature. @sauer2017numerical assumes $|f(x)| \approx 1$ and $|\varepsilon_{f,1}| \approx |\varepsilon_{f,2}| \approx \epsilon_{\mathrm{m}}$. The latter assumption is too conservative compared to his earlier statement from Chapter 0 that the relative rounding error is at most $\epsilon_{\mathrm{m}} / 2$. Even if the round-off error is accumulated through $[x+h] \ne x-h$ and $|f'(x)| > 1$, this conservatism is stronger than the conservatism in our derivation, and the worst case is much less likely than the typical case. The absolute difference $|\varepsilon_{f_2} - \varepsilon_{f_1}|$, bounded above by $\epsilon_{\mathrm{m}} |f(x)|$, takes smaller values in most cases. If the relative rounding errors are independently uniformly distributed over $[-\epsilon_{\mathrm{m}}/2, \epsilon_{\mathrm{m}}/2]$, then $\varepsilon_{f_2} - \varepsilon_{f_1}$ has a *triangular* (not uniform) distribution on $[-\epsilon_{\mathrm{m}}, \epsilon_{\mathrm{m}}]$ with density $\frac{1-|t|}{\epsilon_{\mathrm{m}}} \mathbb{I}(t \le \epsilon_{\mathrm{m}})$. If the total rounding error were uniformly distributed on $[-\epsilon_{\mathrm{m}}, \epsilon_{\mathrm{m}}]$, its variance and mean absolute deviation would be 2 and 1.5 times higher, respectively. @gill1981practical remark in Section 8.5.1.1 that even if the function-rounding error is theoretically bounded by some number $\varepsilon^*$, this number may not be unique: multiple upper bounds may exist. The bound should represent a good estimate of the rounding error at all points in the neighbourhood of the evaluation point, which is usually achievable through rounding-error analysis. For many applied researchers, the goal is minimising the total error comprising the *average* (not maximum) absolute rounding error, as in @dumontet1977determination. In this case, the rounding-error component must be divided by 1.5: \[ \varepsilon_{\mathrm{r}} \le \frac{|f(x)|\epsilon_{\mathrm{m}}}{3h} \quad \Rightarrow \quad h^*_{\mathrm{CD}_2} = \sqrt[3]{\frac{|f(x)| \epsilon_{\mathrm{m}}}{|f'''(x)|}}. \] This translates to the step size being 1.145 times shorter than the one relying on the conservative rounding-error bound. Intuitively, a smaller rounding-error component allows the user to gain accuracy by reducing the truncation-related part of the error, although in most applications, this factor may be safely ignored. The opposite occurs if $f(x+h)$ and $f(x-h)$ come from a numerical routine with limited accuracy. In such cases, the relative error of each evaluation is much higher, and the bound on the difference $|\varepsilon_{f,2} - \varepsilon_{f,1}|$ must be adjusted accordingly. See Section~\ref{sec:noisy} for examples of such adjustments. ## One-sided derivatives \textcolor{red}{!!! TODO} We apply a similar argument to compute one-sided numerical derivatives of costly functions when the value $f(x)$ is already known and only $f(x+h)$ remains to be computed. \textcolor{red}{!!! Compute the remainder term for non-symmetric stencils (should be larger for $f'$ on [0, 1, 2] than on [-1, 0, 1]) and advocate symmetric stencils} Mathematicians noticed the superior accuracy of approximations of functions on grids symmetric around the value of interest more than a century ago. Symmetric stencils for evaluations of derivatives were explicirtly proposed in @aitken1938application, following his 1932 paper on interpolation by iteration based on an earlier 1928 result by C. Jordan that recast the 1901 Everett interpolation formula into a format that was more parsimonious in terms of computations. The computational advantages and higher accuracy of symmetric stencils on finite-precision computers were restated by @oliver1975selection, who advocate the use of symmetric stencils ‘wherever possible’. Although inferior in terms of accuracy, one-sided derivatives might be the only feasible solution where one evaluation of $f(x)$ takes dozens of seconds or where the gradient-based optimisation routines takes too many steps to converge. Except for the situations where the curvature of a computationally expensive function must be inferred from what the researcher can glean from existing cached results, there are no other practical or theoretical reasons to use one-sided derivatives with more than two grid points instead of two-sided ones. <...> The fact that FD approximations have such a high total error was noted by early researchers (@gill1981practical, Section 8.2.2) and can be summarised as follows: **at least half of all digits of a FD numerical derivative are wrong**. The ‘at least’ part is due to the function scaling and step size; higher accuracy cannot simply be achieved. To the contrary, **at least 1/3 of all digits of a CD numerical derivative are wrong**, which leaves more room for error in such applications optimising with a gradient-based stopping rule. If the algorithm terminates when the Euclidean norm of the gradient is less than some tolerance (usually $\approx \sqrt{\epsilon_{\mathrm{m}}}$, then, the FD derivative with error magnitude comparable to the tolerance should be rejected in favour of the more accurate CD derivative. This does not apply if the maximisation is preliminary and a more accurate optimiser at a later stage refines the argument value for a more rigorous solution. ## Second derivatives The formula~\eqref{eq:wsum} yields the coefficients for derivatives of any order; the following invocation of `fdCoef()` yields the optimal coefficients for the standard symmetric stencil. ```{r} fdCoef(deriv.order = 2, acc.order = 2) # Same as fdCoef(2) ``` To find the best approximation on this stencil, we use a similar expansion to quantify the truncation error: \[ f''(x) = \frac{f(x-h) - 2f(x) + f(x+h)}{h^2} + O(h^2) \] \[ f(x\pm h) = f(x) \pm \frac{h}{1!}\cdot f'(x) + \frac{h^2}{2!} f''(x) \pm \frac{h^3}{3!} f'''(x) + \frac{h^4}{4!} f''''(x+\alpha_4 h) \] where $\alpha_4^+, \alpha_4 \in [0, 1]$ and $\alpha_4 \in [-1, 1]$. \textcolor{red}{!!! ERROR: check that $\alpha_4$ should belongs to [-1, 1] (earlier versions had [0, 1])} The approximations differ from the true derivatives by the following terms: \[ f''_{\mathrm{CD}}(x) = \frac{f(x-h) - 2 f(x) + f(x+h)}{h^2} \\ = \frac{h^2 f''(x) + \frac{h^4}{24} (f''''(x+\alpha_{4}^+ h) + f''''(x - \alpha_{4}^- h)) }{h^2} = f''(x) + \frac{h^2}{12}f''''(x + \alpha_4 h), \] where $\alpha_4 \in [0, 1]$. Since $h\to 0$, we use the same approximate bound for steps of length $2h$: \[ |[f(x\pm 2h)] - f(x\pm 2h)| \le 0.5|f(x \pm 2h)|\epsilon_{\mathrm{m}} \approx 0.5|f(x)|\epsilon_{\mathrm{m}} \] The rounding error in the numerator is bounded thusly: \[ [f(x-h)] - 2[f(x)] + [f(x+h)] = f(x-h) + \varepsilon_1 - 2(f(x) + \varepsilon_2) + f(x-h) + \varepsilon_3 \\ \le f(x-h) - 2f(x) + f(x+h) + 4\cdot 0.5|f(x)| \cdot \epsilon_{\mathrm{m}} \\ = f(x-h) - 2f(x) + f(x+h) + 2 |f(x)| \epsilon_{\mathrm{m}} \] The overall error is \[ f''(x) - \hat f''_\mathrm{CD}(x) = f''(x) - \frac{[f(x-h)] - 2[f(x)] + [f(x+h)]}{h^2} \\ \le f''(x) - \frac{f(x-h) - 2f(x) + f(x+h)}{h^2} + \frac{2 |f(x)| \epsilon_{\mathrm{m}} }{h^2} \\ \approx \frac{h^2}{12}f''''(x + \alpha_4 h) + \frac{2 |f(x)| \epsilon_{\mathrm{m}} }{h^2} \] Since $f''''(x + \alpha_4 h) \approx f''''(x)$, the absolute error of $f''_{\mathrm{CD}_2}(x)$ is bounded by \[ \varepsilon(h) := \frac{|f''''(x)|}{12}h^2 + 2|f(x)|\epsilon_{\mathrm{m}} h^{-2} \] Minimising $\varepsilon(h)$ with respect to $h$ yields \[ \varepsilon'(h) = \frac{|f''''(x)|}{6}h - 4|f(x)|\epsilon_{\mathrm{m}} h^{-3} = 0, \] \[ \frac{|f''''(x)|}{6}h = 4|f(x)|\epsilon_{\mathrm{m}} \frac{1}{h^3} \quad \Rightarrow \quad h = \sqrt[4]{\frac{24 |f(x)| \epsilon_{\mathrm{m}}}{|f''''(x)|}} \] Without additional information about the function, one can assume $|f(x) / f''''(x)| \approx 1$ and take $h^* = \sqrt[4]{24 \epsilon_{\mathrm{m}}} \approx 0.00027$. If higher-order derivatives of $f$ are close to zero, then, there is no penalty in taking larger steps. ## Higher derivatives The analysis for third and fourth differences contains a new obstacle because it involves steps in multiples of $h$. ```{r} w3 <- fdCoef(3) w4 <- fdCoef(4) print(w3) print(w4) ``` \[ f'''(x) = \frac{-0.5f(x-2h) + f(x-h) - f(x+h) + 0.5f(x+2h)}{h^3} + O(h^4) \] \[ f''''(x) = \frac{f(x-2h) - 4f(x-h) + 6f(x) - 4f(x+h) + f(x+2h)}{h^4} + O(h^5) \] The Taylor expansions for step-size multipliers $c \in \{1, 2\}$ are \[ f(x\pm c\cdot h) = f(x) \pm \frac{ch}{1!}\cdot f'(x) + \frac{(ch)^2}{2!} f''(x) \pm \frac{(ch)^3}{3!} f'''(x) + \frac{(ch)^4}{4!} f''''(x) \\ \pm\begin{cases} \frac{(ch)^5}{5!} f^{(V)}(x + \alpha ch), & m=3\\ \frac{(ch)^5}{5!} f^{(V)}(x) + \frac{(ch)^6}{6!} f^{(VI)}(x+\alpha ch), & m=4, \end{cases} \] where $\alpha \in [0, 1]$. The value of the multiplier $\alpha$ in the Lagrange remainder is different for each step $\pm h$, $\pm 2h$. For $m=3$, the expression becomes \[ f'''_{\mathrm{CD}}(x) = \frac{0.5(-f(x-2h) + 2f(x-h) - 2 f(x+h) + f(x+2h))}{h^3} \\ = \frac{h^3 f'''(x) + \frac{h^5}{240} (32 f^{(V)}(x+2\alpha_{51} h) - 2f^{(V)}(x+\alpha_{52} h) - 2f^{(V)}(x-\alpha_{53} h) + 32 f^{(V)}(x+2\alpha_{54} h)}{h^3}, \] where $\alpha_{51}, \alpha_{52}, \alpha_{53}, \alpha_{54} \in [0, 1]$. Here, the GIVT cannot be applied because not all coefficients on $f^{(V)}$ are positive. They can still be grouped: \[ 32 f^{(V)}(x+2\alpha_{51} h) - 2f^{(V)}(x+\alpha_{52} h) - 2f^{(V)}(x-\alpha_{53} h) + 32 f^{(V)}(x+2\alpha_{54} h) \\ = 64 f^{(V)} (x + 2\alpha_{5p}h) - 4 f^{(V)} (x + \alpha_{5m}h), \] where $\alpha_{5p}, \alpha_{5m} \in [0, 1]$. To place an upper bound on this term, we use the triangle inequality for the terms' absolute values: \[ 64 f^{(V)} (x + 2\alpha_{5p}) - 4 f^{(V)} (x + \alpha_{5m}) \le 64| f^{(V)} (x + 2\alpha_{5p}h)| + 4|f^{(V)} (x + \alpha_{5m}h)| \\ \le 68 \max\{| f^{(V)} (x + 2\alpha_{5p}h)|, |f^{(V)} (x + \alpha_{5m}h)|\} \] Again, we rely on $h\to0$ to obtain $f^{(V)} (x + 2\alpha_{5p}h) \approx f^{(V)}(x) \approx f^{(V)} (x + \alpha_{5m}h) \approx f^{(V)}(x)$. Instead of repeating the same chain of calculations with obfuscated notation, we show how to obtain the coefficients on the non-vanishing higher-order terms for $m=4$ algorithmically: ```{r} denom <- factorial(0:6) names(denom) <- paste0("f'", 0:6) num.0 <- c(1, rep(0, 6)) # f(x) = f(x) + 0*f'(x) + 0*f''(x) + ... num.h <- rep(1, 7) num.2h <- 2^(0:6) # f(x-ch) = f - ch f' + (ch)^2/2 f'' - (ch)^3/6 f''' ... num.mh <- suppressWarnings(num.h * c(1, -1)) # Relying on recycling num.m2h <- suppressWarnings(num.2h * c(1, -1)) num <- colSums(rbind(num.m2h, num.mh, num.0, num.h, num.2h) * w4$weights) print(round(num / denom, 5)) ``` As expected, there is a unit coefficient on the 4^th^-order term. However, the last term shown above is not the correct one because the non-applicability of the GIVT requires a higher upper bound through the use of absolute values: ```{r} sum(abs(w4$weights[c(1, 2, 4, 5)] * c(num.m2h[7], num.mh[7], num.h[7], num.2h[7]))) / denom[7] ``` This yields a coefficient on $f^{(VI)}(x)$ equal to $136/720 = 17/90$, not $120/720 = 1/6$. As for the rounding errors in the numerator, they are bounded by the sum of absolute relative errors. For $m=3$, \[ [-0.5f(x-2h)] + [f(x-h)] - [f(x+h)] + 0.5[f(x+2h)] = \\ -0.5f(x-h) - 0.5\varepsilon_1 + f(x-h) + \varepsilon_2 - f(x+h) - \varepsilon_3 + 0.5f(x+2h) + 0.5\varepsilon_4 \\ = \ldots - 0.5\varepsilon_1 + \varepsilon_2 - \varepsilon_3 + 0.5\varepsilon_4 \\ \le \ldots + 0.5 |f(x)| \epsilon_{\mathrm{m}} (0.5 + 1 + 1 + 0.5) = \ldots + 1.5 |f(x)| \epsilon_{\mathrm{m}} \] For $m=4$, we compute the multiplier on $0.5 |f(x)| \epsilon_{\mathrm{m}}$ in one line of code: ```{r} sum(abs(w4$weights)) ``` The total errors are thus \[ f'''(x) - [f'''_\mathrm{CD}(x)] \le \varepsilon_3(h) := \frac{17 f^{(V)}(x)}{60} h^2 + \frac{1.5 |f(x)| \epsilon_{\mathrm{m}} }{h^3} \] \[ f''''(x) - [f''''_\mathrm{CD}(x)] \le \varepsilon_4(h) := \frac{17 f^{(VI)}(x)}{90} h^2 + \frac{8 |f(x)| \epsilon_{\mathrm{m}}}{h^4} \] Finding the minimum of the total error functions: \[ \frac{17|f^{(V)}(x)|}{30}h = 4.5|f(x)|\epsilon_{\mathrm{m}}\frac{1}{h^4} \quad \Rightarrow \quad h^*_3 = \sqrt[5]{\frac{135 |f(x)| \epsilon_{\mathrm{m}}}{17|f^{(V)}(x)|}} \] \[ \frac{17|f^{(VI)}(x)|}{45}h = 32|f(x)|\epsilon_{\mathrm{m}}\frac{1}{h^5} \quad \Rightarrow \quad h^*_4 = \sqrt[6]{\frac{1440 |f(x)| \epsilon_{\mathrm{m}}}{17|f^{(VI)}(x)|}} \] Without any extra information about the function, one can assume $|f(x) / f^{(V)}(x)| \approx 1$, $|f(x) / f^{(VI)}(x)| \approx 1$, and take $h^*_3 = \sqrt[5]{\frac{135}{17} \epsilon_{\mathrm{m}}} \approx 0.0011$ and $h^* = \sqrt[6]{\frac{1440}{17} \epsilon_{\mathrm{m}}} \approx 0.005$. In applied work, functions that are not approximated well by 4^th^- and 5^th^-order polynomials are not common, which is why it is possible that $f^{(V)}(x) \to 0$, $f^{(VI)}(x) \to 0$, and the main source of error is precision loss due to the division by $h^m$, not poor Taylor approximation. ## Fourth-order-accurate derivatives If one requests `fdCoef(acc.order = 4)`, then, the previously obtained result for $h^*_{\mathrm{CD}_2}$ is no longer valid because the truncation error depends on a different power of $h$. We compute this coefficient exactly: ```{r} w <- fdCoef(acc.order = 4) h2 <- 2^(0:5) h <- rep(1, 6) hm <- h * c(1, -1) h2m <- h2 * c(1, -1) coef.tab <- rbind(h2m, hm, h, h2) # Here, using rbind is more convenient rownames(coef.tab) <- names(w$weights) colnames(coef.tab) <- paste0("f'", seq_len(ncol(coef.tab)) - 1) print(coef.tab * w$weights) print(colSums(coef.tab * w$weights)) ``` This calculation confirms that there is no $f'''(x)$ in the expression for the difference-based approximation. We apply the same technique due to the presence of different step sizes and the non-applicability of the GIVT: \[ \frac{1}{12} f(x-2h) - \frac23 f(x-h) + \frac23 f(x+h) - \frac{1}{12} f(x+2h) = \\ f'(x) + \frac{h^5}{120}\left(-\frac83f^{(V)}(x-2\alpha_1 h) + \frac23 f^{(V)}(x-\alpha_2 h) + \frac23 f^{(V)}(x+\alpha_3 h) - \frac83 f^{(V)}(x+2\alpha_4 h) \right) \] Some terms in the brackets can still be grouped: \[ -\frac83 f^{(V)}(x-2\alpha_1 h) + \frac23 f^{(V)}(x-\alpha_2 h) + \frac23 f^{(V)}(x+\alpha_3 h) - \frac83 f^{(V)}(x+2\alpha_4 h) = -\frac{16}{3} f^{(V)}(c_1) + \frac43 f^{(V)}(c_2), \] where $x-2h \le c_1 \le x+2h$ and $x-h \le c_2 \le x+h$. An upper bound through approximation is available: \textcolor{red}{!!! Move this expression into the previous section} \[ |af^{(V)}(c_2) - bf^{(V)}(c_1)| \le a|f^{(V)}(c_1)| + b|f^{(V)}(c_2)|, \quad |f^{(V)}(c_1)| \approx |f^{(V)}(c_2)| \approx f^{(V)}(x), \] therefore, the difference in the numerator is less or equal to $\frac{\frac{20}3 |f^{(V)}(x)|}{120} h^5$, and the truncation error is bounded by the same divided by $h$: \[ |f'(x) - f'_{\mathrm{CD}_4}(x)| \approx \frac{|f^{(V)}(x)|}{18} h^4. \] The rounding error is bounded by the weighted value of $|f(x)|\epsilon_{\mathrm{m}}$: \[ \sum_{i} [w_i f(x_i)] - \sum_{i} w_i f(x_i) \le \sum_i |w_i f(x_i)| \cdot 0.5\epsilon_{\mathrm{m}}, \] and since the terms in the numerator are approximately equal to $f(x)$, the absolute rounding error is bounded by $0.75 |f(x)| \epsilon_{\mathrm{m}}$: ```{r} 0.5*sum(abs(w$weights)) ``` The total error minimisation problem is similar to the ones solved before: \[ \varepsilon(h) = \frac{|f^{(V)}(x)|}{18} h^4 + \frac{0.75 |f(x)| \epsilon_{\mathrm{m}}}{h} \] \[ \varepsilon(h) = 0 \quad \Rightarrow \quad \frac{2|f^{(V)}(x)|}{9} h^3 = \frac{0.75 |f(x)| \epsilon_{\mathrm{m}}}{h^2} \quad \Rightarrow \quad h^{*}_{\mathrm{CD}_4} = \sqrt[5]{\frac{27 |f(x)| \epsilon_{\mathrm{m}}}{8 |f^{(V)}(x)|}} \propto \sqrt[5]{\epsilon_{\mathrm{m}}} \] The reduction of the truncation error from the Taylor series allows for a larger step size for better rounding-error control. The total error is therefore of the order $O(\epsilon_{\mathrm{m}}^{4/5})$, which translates into 1--2 more accurate decimal digits compared to the second-order-accurate derivative: $\log_{10} \epsilon_{\mathrm{m}}^{2/3} / \epsilon_{\mathrm{m}}^{4/5} \approx 2$. Of course, this is true only if the guessed value of $|f^{(V)}|$ is reasonable. ## General derivative and accuracy orders Although calculation of derivatives of order greater than 4 is of purely theoretical interest, we provide a rule of thumb for bandwidth choice and the algorithm to compute the coefficients on the error components. The preceding derivations yield a pattern: every time a higher-order derivative is required, the power of $h$ in the denominator increases by 1, and the power of $h$ in the numerator remains at~2 because the central differences in question are second-order-accurate. For the $m$^th^ derivative of accuracy order $a=2$, the optimal step size $h^*_m$ is $c_m \sqrt[m+2]{\epsilon_{\mathrm{m}}}$. The rough approximation $c_m \approx 1$ yields $h^*_m \approx \sqrt[m+2]{\epsilon_{\mathrm{m}}}$. A slightly finer version relying on analytical expressions is as follows. When accuracy of order 2 is requested, the Taylor-series truncation error is $O(h^2)$, and the multiplier on $h^2$ depends on the $(m+2)$^th^ derivative of $f$. For step sizes $0, \pm h, \pm 2h, \ldots$ and weights $w_{0}, w_{\pm 1}, w_{\pm 2}, \ldots$, the coefficients on the non-vanishing terms are computed as follows. ```{r} m <- 7 # Example order w <- fdCoef(m) k <- sum(w$stencil > 0) # ±h, ±2h, ..., ±kh num.pos <- sapply(1:k, function(i) i^(0:(m+2))) num.neg <- apply(num.pos, 2, function(x) x * c(1, -1)) num.neg <- num.neg[, rev(seq_len(ncol(num.neg)))] nz <- abs(w$stencil) > 1e-12 # Non-zero function weights coef.tab <- sweep(cbind(num.neg, num.pos), 2, w$weights[nz], "*") rownames(coef.tab) <- paste0("f'", seq_len(nrow(coef.tab))-1) colnames(coef.tab) <- names(w$weights[nz]) print(coef.tab) ``` This table shows the contribution of each derivative to the numerator of the approximation. By construction, the row sums add up to zero because every initial term, except for the $m$^th^ derivative, must be zeroed out: ```{r} rs <- rowSums(coef.tab) print(round(rs, 4)) ``` The coefficient on the $(m+2)$^th^ derivative is also non-zero, but, as we established before, it underestimates the true upper bound. Due to the non-applicability of the GIVT with different step sizes, we add up the absolute values of the expansion coefficients. Finally, this value must be divided by the denominator -- the factorial. This yields the coefficient on $h^2$: ```{r} print(c1 <- sum(abs(coef.tab[nrow(coef.tab), ])) / factorial(m+2)) ``` The second part, the rounding error, can be bounded from above by the sum of absolute values of the coefficients in the numerator. The coefficient on $|f(x)| \epsilon_{\mathrm{m}}$ is ```{r} print(c2 <- 0.5*sum(abs(w$weights))) ``` The expression below is minimised w.r.t.\ $h$ where solving the FOC is easy because both components are power functions: \[ \varepsilon(h) = c_1 f^{(m+2)}(x) h^2 + c_2 |f(x)| \epsilon_{\mathrm{m}} \frac{1}{h^m} \] \[ \varepsilon'(h) = 0 \quad \Rightarrow \quad 2c_1 |f^{(m+2)}(x)| h = m c_2 |f(x)| \epsilon_{\mathrm{m}} \frac{1}{h^{m+1}} \quad \Rightarrow \quad h = \sqrt[m+2]{\frac{mc_2 |f(x)| \epsilon_{\mathrm{m}}}{2c_1 |f^{(m+2)}(x)|}} \] In general, the optimal step size for the desired accuracy order $a$ (even) and derivative order $m$ is proportional to $\sqrt[a+m]{\epsilon_{\mathrm{m}}}$. \textcolor{red}{!!! Explicitly write the general formula} The accuracy loss due to the guessed higher-order derivative value depends on the exact values of $h$ and $h^*$. As it is shown in Section 3.4 of @mathur2013algorithm, the slope of the total-error function is not equal for over- and under-estimated optimal step sizes: in logarithmic axes, for $h > h^*$, the slope is positive and is equal to $a$ (accuracy order), and for $h < h^*$, it is negative and equal to $-m$ (differentiation order). For higher-order-accurate first derivatives, the optimal size must be larger than the optimal size for $a=2$, but not by much, lest the truncation-related part of the error should explode, which poses a problem that can be solved via data-driven methods. A viable solution is proposed in the same thesis and is described in Section \textcolor{red}{!!!}. ## Cross-derivatives \textcolor{red}{!!! TODO} * Move the part from the Hessian section here # Gradients, Jacobians, and Hessians ## Numerical gradients \textcolor{red}{!!! TODO} * Gradients are repeated derivatives; choose the optimal size for each coordinate * The theory remains the same * Compute gradients faster by pre-defining a grid and doind parallel calls across all steps for all coordinates to maximise CPU use and reduce overhead and idle time ## Numerical Jacobians \textcolor{red}{!!! TODO} * Jacobians are repeated gradients; choose the optimal size for each matrix element * The theory remains the same * Same speed-up: pre-define a grid of all parameters (3D: all steps $\times$ all input argument coordinates $\times$ all output function coordinates), do parallel calls ## Numerical Hessians and cross-derivatives The Hessian matrix for a function $f$ is denoted by $H_f$ and is defined as the square matrix of second-order partial derivatives of $f$: \[ H_f (x) = \nabla^2 f(x) = \left\{ \frac{\partial^2}{\partial x_i \partial x_j} f(x)\right\}_{i,j=1}^{\dim x} = \begin{pmatrix} \frac{\partial^2 }{\partial x_1^2} & \cdots & \frac{\partial^2 }{\partial x_1\,\partial x_p} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 }{\partial x_p\,\partial x_1} & \cdots & \frac{\partial^2 }{\partial x_p^2} \end{pmatrix} f(x) \] The Hessian can be expressed more concisely as the transposed Jacobian of the gradient: \[ H_f (x) = J^T (\nabla f(x)) = \begin{pmatrix} \partial / \partial x_1\ (\nabla f(x))^T \\ \vdots \\ \partial / \partial x_p\ (\nabla f(x))^T \end{pmatrix} \] The simplest approach to compute Hessian elements numerically is to calculate finite differences of finite differences w.r.t. two indices. For each $i=1,\ldots,k$, define $h_i := \begin{pmatrix}0 & \ldots & 0 & \underbrace{h}_{i^{\text{th}} \ \text{position}} & 0 & \ldots & 0 \end{pmatrix}'$ as the length-$k$ vector where the only non-zero element is in the $i$^th^ coordinate. Then, 4 evaluations of $f$ are required to compute $H_{ij}$ via central differences: \[ (\nabla f(x))^{(i)} := \nabla_i f(x) \approx \frac{f(x + h_i) - f(x - h_i)}{2 h} \] \begin{equation}\label{eq:nhess} H_{ij} \approx \tfrac{\nabla_i f(x + h_j) - \nabla_i f(x - h_j)}{2 h} \approx \tfrac{f(x + h_i + h_j) - f(x - h_i + h_j) - f(x + h_i - h_j) + f(x - h_i - h_j)}{4h^2} \end{equation} If $f(x)$ is computationally costly, using forward differences $\frac{f(x + h_i) - f(x)}{h}$ requires fewer evaluations and sacrifices accuracy for speed gains. This weighted sum is similar to the expression for second derivatives ($f''_{\mathrm{CD}}(x)$), but derivation of the truncation error in this case becomes notationally more cumbersome. By analogy, we expect the truncation error to be of the second order in terms of the power of $h$ and of the fourth order of the cross-derivative of $f$. Therefore, the optimal step size for Hessian computation may be safely assumed to be of the order $\sqrt[4]{\epsilon_{\mathrm{m}}}$. (Intuitively, a larger step size is required because the division in this case is by $h^2$). The main diagonal of the Hessian, $H_{ii}$, contains second derivatives of $f$, and the result above for functions of scalar arguments applies: $h^{**}_{\mathrm{CD}_2} \propto \sqrt[4]{\epsilon_{\mathrm{m}}}$. The $f''$ column contributes non-zero off-diagonal elements, and all the terms containing $f'''$ cancel out, which can be verified numerically: \textcolor{red}{!!! Check for mistakes} ```{r} getCoefs <- function(x, ord) do.call(expand.grid, replicate(ord, x, simplify = FALSE)) rowProd <- function(x) apply(x, 1, prod) getMults <- function(ord) { steps <- list(c(1, 1), c(-1, 1), c(1, -1), c(-1, -1)) coefs <- lapply(steps, getCoefs, ord = ord) signs <- lapply(coefs, rowProd) mults <- signs[[1]] - signs[[2]] - signs[[3]] + signs[[4]] ntab <- expand.grid(replicate(ord, c("i", "j"), simplify = FALSE)) names(mults) <- apply(ntab, 1, paste0, collapse = "") return(mults) } print(getMults(2)) print(getMults(3)) ``` The terms with $f$ and $f'$ disappear, and the 4-term sum becomes \[ \frac{f(x + h_{ij}) - f(x - h_i + h_j) - f(x + h_i - h_j) + f(x - {h_ij})}{4h^2} = \frac{\frac{1}{2}h^2 (4H_{ij} + 4H_{ji}) + O(h^4)}{4h^2}, \] therefore, \[ H_{ij} - \hat H_{ij} = O(h^2), \] For large steps, $h \gg 0$, the dominant term in the total error of $\nabla_i \nabla_j f(x)$ has the same order as the dominant term of $\nabla_i \nabla_i f(x) = f''_{ii}(x)$. The multipliers on $h^2$ and $1/h^2$ in the expression for the total error change, but the order of the optimal step size is still $h^* \propto \sqrt[4]{\epsilon_{\mathrm{m}}}$. The exact expression for the truncation term depends on the sum of fourth-order cross-derivatives. The following code calculates which fourth derivatives enter the expression for the truncation error using the same logic (we skip the tedious derivation): ```{r} mults4 <- getMults(4) print(mults4[mults4 != 0]) ``` This implies that the error term is determined solely by $\frac{\partial^4 f}{\partial x_i^3 \partial x_j}$ and $\frac{\partial^4 f}{\partial x_i \partial x_j^3}$ owing to the symmetry of cross-derivatives: $f''''_{jiii} = f''''_{ijii} = f''''_{iiji} = f''''_{iiij}$. The exact value of the higher-order coefficient is thus \[ O(h^4) = \frac{h^4}{24} (4 f''''_{jiii}(x + c_1) + 4f''''_{ijjj}(x+c_2)) \approx \frac{h^4}{6} (f''''_{jiii}(x) + f''''_{ijjj}(x)), \] where $c_1, c_2 \in (-h, h)$ are some constants. After dividing by $4 h^2$, we get the truncation error equal to $\frac{h^2}{24} (f''''_{jiii}(x) + f''''_{ijjj}(x))$ that can be plugged into the expression for the total error. The rounding error $[f([x \pm h_i \pm h_j])] - f(x \pm h_i \pm h_j)$ from 4 function evaluations is approximately bounded by $4 \times 0.5 |f(x)| \epsilon_{\mathrm{m}}$. \[ \varepsilon(h) = \frac{|f''''_{ijjj}(x) + f''''_{iiij}(x)|}{24} h^2 + \frac{0.5 |f(x)| \epsilon_{\mathrm{m}}}{h^2} \] \[ \varepsilon(h) = 0 \quad \Rightarrow \quad \frac{|f''''_{ijjj}(x) + f''''_{iiij}(x)|}{12} h = \frac{|f(x)| \epsilon_{\mathrm{m}}}{h^3} \quad \Rightarrow \quad h = \sqrt[4]{\frac{12 |f(x)| \epsilon_{\mathrm{m}}}{ |f''''_{ijjj}(x) + f''''_{iiij}(x)|}} \propto \sqrt[4]{\epsilon_{\mathrm{m}}} \] We recommend that the main diagonal of the Hessian be computed separately using $h^{**}_{\mathrm{CD}_2}$ (or, even better, $h^{**}_{\mathrm{CD}_4}$) estimated for each coordinate, and each off-diagonal element $H_{ij}$ using data-driven procedures described in the section below (e.g. ‘AutoDX’). \textcolor{red}{!!! TODO: write a paragraph about the general case with different step sizes $h_i$ and $h_j$; mentione that is mostly of theoretical, not practical interest} ## Exploiting the Hessian symmetry In theory, by the symmetry of second derivatives, if all partial derivatives are differentiable, then, the crossed partials are equal (see the proof of Theorem 3.3.8, ‘Equality of crossed partials’, in @hubbard2015vector). This is known as the Schwarz's theorem, and the usual requirement of continuity may be weakened (because it implies differentiability). In practice, however, the symmetry of the numerical Hessian depends on how it was computed. If the formula~\eqref{eq:nhess} is used, then, symmetry is guaranteed because the terms of the finite sum are identical: $[H_{ij}] = [H_{ji}]$. This can be exploited to reduce computation time: if $\dim x = k$, calculate only the main upper triangular matrix and reflect it, carrying out only $k(k+1)/2$ operations instead of $k^2$. This is the default behaviour of the `Hessian()` function from this package. If instead, the implementation carries out repeated Jacobian calculation for the numerical gradient, then, there is no guarantee that $\nabla_j [\nabla_i f(x)]$ be equal to $\nabla_i [\nabla_j f(x)]$. This is, e.\,g., the case with the `optimHess()` function from `base`: it calls the gradient in a loop over coordinates. To ensure the symmetry in this case, the numerical Hessian should be averaged with its transpose: \textcolor{red}{!!! TODO} This approach is standard in applications where the matrix of interest has to be positive semi-definite, e.\,g.\ autocorrelation-consistent estimation of variances due to @newey1987simple. \textcolor{red}{!!! TODO: faster and less accurate first differences? A tangent plane should be approximable with only 3 points} If the Hessian is computed via repeated differences of the numerical gradient, the assumption $|[H_{ij}] - [H_{ji}]| \approx 0$ may be used to reduce the number of elements to compute from $p^2$ to $p(p+1)/2$ because only the upper triangular matrix of the partial derivatives needs to be computed: \[ H_{ij} \approx \bigl[ J_j([\nabla_i f(x)]) \bigr] \] \textcolor{red}{!!! TODO: derive the total error with the rounding component from an earlier formula} From the computational point of view, approximating the Hessian by applying first differences to some coordinates of the numerical gradient is suboptimal in terms of numerical accuracy: pairs of values $\{ f(x + h_i + h_j), f(x - h_i + h_j)\}$ and $\{ f(x + h_i - h_j), f(x - h_i - h_j)\}$ are collapsed into single values $[\nabla_i f(x + h_j)]$ and $[\nabla_i f(x - h_j)]$. If the Hessian is evaluated in a time-saving loop for the upper triangular matrix (as it is, e.g, in `numDeriv:::genD.default()` by @numDeriv), then, the rounding error from $[\nabla_i f(x + h_j)]$ is copied over to $H_{ji}$. Not collapsing the evaluated values $f(x + h_i + h_j)$ and $f(x - h_i + h_j)$ into the numerical gradient and storing them separately eliminates the problem $[H_{ij}] = \bigl[ J_j([\nabla_i f(x)]) \bigr] \ne \bigl[ J_i([\nabla_j f(x)]) \bigr] = [H_{ji}]$ because the numerators in the non-simplified expressions of the first differences of the first differences are the same. The current version of the package contains no implementation of the implicit symmetrisation by direct calculation without omission. However, facilities to choose between the quicker (no symmetrisation) and the more accurate (average of the full matrix and its transpose) exist. # Common issues with numerical derivatives ## Stencil choice Early literature (@jordan1928formule, @aitken1938application), being focused on interpolation techniques, distinguished two cases: equidistant stencils symmetric around the zero ($\pm1/2$, $\pm3/2$, $\pm5/2$, \ldots) and equidistant stencils containing the zero (0, $\pm1$, $\pm2$, \ldots). The former was more economical because, as we previously showed, two-term central differences are more accurate than one-sided ones, and minimising the number of terms reduced the time spent by *computers* (humans). Nowadays, it is possible to obtain interpolation coefficients for arbitrary stencils. In terms of step size, the stencil ($\pm1/2$, $\pm3/2$) implies tripling the distance from zero for the outermost points, whereas the step size on the stencil ($\pm1$, $\pm2$) is only doubled. This distinction begs the question: since both approaches are 4\textsuperscript{th} order accurate, which stencil yields a smaller truncation error? \textcolor{red}{!!! Write better} Comparing the stencils $(-2, 1, 1, 2)$ and $(-3, -1, 1, 3)$ is not fair because the Taylor remainder is directly proportional to the larger step size. Since the inner step size of 1 eliminates the first-order terms and leaves the second-order truncation error $O(h^2)$, the outer step size eliminates the second-order truncation error and leaves the fourth-order remainder $O(h^4)$, which is larger for larger $h$. It is tempting to impose a normalisation on the step size for comparing stencils of the form $(-1-s, -1+s, 1-s, 1+s)$ because in this case, the average step size is one. Setting $s=1/2$ yields $(\pm1/2$, $\pm3/2)$ with step tripling, and $s=1/3$ yields $(\pm2/3$, $\pm4/3)$ with step doubling. However, this approach is also problematic because larger values of $s$ result in near-zero step sizes and, therefore, near-zero higher-order Taylor remainders. Proper optimal stencil analysis requires considering the total error including the machine-rounding term that increases for small step sizes. \textcolor{red}{Write down the system of equations in the matrix form!} \textcolor{red}{Derive the expression for the coefficient on the fV for faraway derivatives!} ## Handling very small and very large arguments \textcolor{red}{!!! FIX: the step size should not be relative...} Usually, the step size is proportional to the value of the argument, i.e. $h \propto x \sqrt[c]{\varepsilon_{\text{mach}}}$, where $c$ depends on the derivative and accuracy order (see above). However, for $x\approx 0$, this may result in $[[x]+[h]] = [x]$. To avoid this, the default behaviour in many software implementations, including `numDeriv`, is to use fixed $h$ for $|x| < \delta$. E.g. `numDeriv` uses `h = eps = 1e-4` in place of `h = d*x` if `x < zero.tol`; the default method arguments are `eps = d = 1e-4`, `zero.tol = sqrt(.Machine$double.eps/7e-7) = 1.781e-5`. For iterative step-size search procedures, @stepleman1979adaptive suggest a reasonable workaround to prevent this behaviour: use $h = 0.04\sqrt[c]{\varepsilon_{\text{mach}}}$ as the starting value in the step-search procedure. In practice, one may take any reasonable value that is suitable for the application and supply some *a priori* information about the function and the meaningful distinction between zero and non-zero values. In certain applications, only positive parameter values are allowed, which makes the zero the *boundary of the parameter space* -- in such cases, particular solutions such as ‘take the *relative* step size $10^{-4}$’ are more reliable than ‘take the absolute step size $10^{-5}$’. **Example.** In GARCH models, the constant $\omega$ in the conditional-variance equation $\sigma^2_t = \omega + \alpha \varepsilon^2_{t-1} + \beta \sigma^2_{t-1}$ is usually very small (often in the range $[10^{-7}, 10^{-3}]$). If $\omega = 10^{-6}$, then, using the absolute step size $h = 10^{-5}$ (reasonable in most applications) -- would result in wildly inaccurate numerical derivatives or even invalid variance values. By definition, $\sigma^2_t > 0$, but negative values of $\omega$ may create negative values in the generated series of $\sigma^2_t$, which is meaningless in statistics. In the best case, the estimated numerical differences will be equal to `NA`. In the worst case, the function computing the log-likelihood of such a model will throw an error. It is not uncommon to use $(0, \ldots, 0)$ as the starting value in numerical optimisation; in this case, using the knowledge about the expected reaction of $f(x)$ to changes in each coordinate of $x$ is necessary to avoid disproportionately large truncation or rounding errors. Therefore, if it is possible, the researcher should either take into account the curvature of their functions with respect to the arguments that are too small or use heuristic search procedures to determine the optimal step size. ## Derivatives of noisy functions \label{sec:noisy} The very definition of a derivative involves differentiability, which sounds tautological, but specifically, this assumption is often violated in applied research. In fact, the computer representations $[x+h]$ and $\hat f([x+h])$ are discontinuous due to the finite machine precision (the number of bits in the mantissa). The machine epsilon is such a number that $[1 + \delta] = 1$ for $|\delta| \le \epsilon_{\mathrm{m}}/2$; therefore, if $x = [x]$, then, $f([x \pm \delta x]) = f(x)$ for $|\delta| \le \epsilon_{\mathrm{m}}/2$. Hence, the core problem in numerical differentiation is working with discontinuous functions where $\lim_{h\to 0} \frac{\hat f([x+h]) - \hat f([x])}{h} = 0$, but the derivative of the noiseless function $f$ -- as if the machine had infinite precision -- has to be computed with the maximum achievable precision. So far, we have considered the ‘best’ case assuming that $\hat f([x+h]) - f(x) \le \epsilon_{\mathrm{m}}/2$. However is quite common to have objective functions that depend on the values of inaccurate internal numerical routines. Usually, such routines involve optimisation, or integration, or root search, or computing derivatives. Depending on the convergence tolerance (and the ensuing routine error magnitude) of the inner routines, small changes in the input parameter of the outermost objective function might lead to abrupt changes in the return values of inner functions, which implies discontinuity and, therefore, non-differentiability. More often than not, the total error of internal routines is outside the user's control. Even such innocent actions as swapping two mathematical operations may lead to numerical errors. Example: if the internal loop of empirical-likelihood maximisation involves replacing logarithms with their 4^th^-order Taylor expansion for small inputs, then, the order of multiplication/division in computing the Taylor approximation affects the result! Consider computing $\frac14 (x-t)^4 / t^4$ and getting two different results: ```{r} x <- -0.2456605107847454 # 16 sig. digs t <- 1/59 print(res1 <- (x-t)^4 * (t^-4 / 4), 17) print(res2 <- (x-t)^4 / (t^4 * 4), 17) print(res1 - res2) ``` Two mathematically equivalent expressions, $(x-t)^4 \cdot (t^{-4} / 4)$ and $(x-t)^4 / (4t^4)$, differ in absolute terms by more than $1000 \epsilon_{\text{m}}$! Evaluating $(t^{-4} / 4) -1/ (4t^4)$ yields zero exactly; however, the extra multiplication by $(x-t)^4$ creates lost accuracy bits. The magnitude of the objective function in this specific application is approximately 1, which is why the termination of the optimisation algorithm occurs in a different number of steps with the two different implementations of the Taylor series. The consequence of this loss is the increase in the rounding-error variance and the need to take into account the absolute error magnitude of $\hat f([x+h]) - f(x+h)$ into account; the relative error is still bounded by $\epsilon_{\text{m}}/2$, but when some intermediate terms of the computation shoot off due to division by small quantities (in this example, $4t^4 \approx 3.3\cdot 10^{-7}$), it changes the magnitude of the numerator. Usually, it is not feasible to extract the information about the most ill-conditioned operation in the chain of thousands or millions of operations taking place under the hood of present-day objective functions, but a rough measure of the ‘combined noisiness due to inaccurate internal routines’ could be calculated based on certain input characteristics of the input. \textcolor{red}{!!! Comment: no straightforward correspondence between `rel.tol` and `xtol`} If computing $f$ involves optimisation with a relative-tolerance-based stopping criterion `ret.tol = 1e-8`, then, the difference might be as bad as $\epsilon_{\mathrm{m}} \cdot 10^8$! The same argument applies to numerical integration, root solving, stochastic algorithms and such: if $\hat f(x + h)$ is prone to deviating from $f(x+h)$ due to some extra losses occurring under the hood of $f$, the bound on the rounding error must reflect this fact. It would be dangerous to assume $|\varepsilon_{f_2} - \varepsilon_{f_1}| \le \epsilon_{\mathrm{m}}$ if changes of inputs cause the internal routines of $f$ to go differently due to the butterfly effect. If $f$ relies on stochastic optimisation with relative tolerance $2.22 \cdot 10^{-7}$ (not uncommon in popular algorithms) as the stopping criterion, $|\varepsilon_{f,2} - \varepsilon_{f,1}| \le 10^9 \cdot \epsilon_{\mathrm{m}}$, which is why the rule-of-thumb $h^*$ needs to be blown up by a factor of 1000 (becoming 0.006) to obtain any meaningful information about the slope of the function! Ironically, this is the reason why repeated differences for higher-order derivatives can be so unstable: central differences of well-behaved functions lose 5 accurate digits, and one-sided differences, 8 (out of 16!) digits. \textcolor{red}{!!! The part with the numerical gradient here} \textcolor{red}{!!! Paeallels: Numerical integration? Root solving? Nested optimisation?} The problem with function optimisation is, the relative tolerance as a stopping criterion is \textcolor{red}{TODO} \textcolor{red}{!!! There is an application where the evaluation error is available for free -- numerical integration. Gauss--Kronrod, G7 - K15.} \textcolor{red}{!!! Examples of numerical integration from our department: Suarez, Schumann, Tripathi; Kostyrka, Malakhov. Many nuisance parameters such as fixed effects that are integrated out or local likelihoods that are maximised numerically.} \textcolor{red}{!!! Write a simulation with a two-level function: sum of expressions with analytical solutions, but optimised numerically to gauge the error impact} ## Accuracy loss due to repeated differencing \textcolor{red}{!!! TODO: Case 1: second derivatives} \textcolor{red}{!!! TODO: Case 2: Hessian} So far, we have considered only the ‘ideal’ case where the diagonal elements of a Hessian are computed via the proper formula for second derivatives, and the off-diagonal elements are computed in one single step. However, this approach, being the simplest to analyse, is cumbersome to implement. It is much more common to compute the Jacobian of the gradient (e.g. the popular implementation `stats::optimHess` follows this approach). This might incur extra accuracy loss because the numerator in this approach contains the difference of ‘lossy’ numerical derivatives, not the original function values. The effect of this error compounding can be illustrated by the following example. Recall that two-sided differences are accurate only up to (approximately) 2/3 of their significant digits. Indeed, $|\hat f([x+h]) - f(x-h)| \le \epsilon_{\mathrm{m}} / 2$, but $|\hat f'_{\mathrm{FD}}([x+h]) - f'(x+h)| = O(h^2) = O(\epsilon_{\mathrm{m}}^{2/3})$ for the optimal $h \propto \sqrt[3]{\epsilon_{\mathrm{m}}}$. The noisy $\hat f'_{\mathrm{CD}_2} (x+h) = f'(x+h) + O(\epsilon_{\mathrm{m}}^{2/3})$ enters the numerator of the second step and is multiplied by the $1/h$ factor again. Therefore, rounding errors accumulate in the two-step differencing procedure to a greater extent than in the one-step one, and dedicated analysis is required to gauge the order of magnitude of the accuracy loss. \textcolor{red}{!!! TODO} # Complex derivatives If the function $f(x)$ supports complex arguments and outputs, as is the case with many simple arithmetic and algebraic functions, then, evaluating complex arguments allows one to obtain highly accurate numerical derivatives with fewer function evaluations. The Taylor expansion for a function of a complex variable is \[ f(x + \mathrm{i}h) = f(x) + \mathrm{i}h f'(x) - \frac{h^2}{2} f''(x) - \mathrm{i} \frac{h^3}{3!} f'''(x) + \frac{h^4}{4!} f''''(x) + O(h^5). \] The imaginary part of this expression divided by $h$ is \[ \frac{\Im f(x + \mathrm{i}h)}{h} = f'(x) - \frac{h^2}{3!} f'''(x) + O(h^4). \] One evaluation and one division yield the largest error term of the order $O(h^2)$, and the overall accuracy does not suffer from catastrophic cancellation. The total approximation error of this method stays small even for small step sizes $h$. See @squire1998using for numerical examples. The choice of the optimal step size is not obvious in the general case because the user should choose not only $h$ but also the direction (the power of $\mathrm{i}$). @martins2003complex give the following recommendation: require that $h^2 |f'''(x)/3!| < \varepsilon |f'(x)|$, where $\varepsilon$ is the relative algorithm precision, for which $h$ has to be small enough. If $f(x) \approx 0$ or $f'(x) \approx 0$, this condition might not hold. However, a small complex step $h = 10^{-20}$ is a reasonable rule-of-thumb value. Higher-order differences require fewer evaluations, too, but are not difference-free, which is why \[ f''(x) = \frac{2[f(x) - \Re f(x+\mathrm{i}h)]}{h^2} - \frac{h^2}{12} f''''(x) + O(h^4) \] suffers from machine round-off to a certain degree. The method accuracy and convergence can be improved by carefully choosing the complex-step *angle*: \[ f'(x) = \frac{\Im\{ f(x+\mathrm{i}^{2/3}h) - f(x+\mathrm{i}^{8/3}h)\} }{\sqrt{3} h} - \frac{h^4}{120} f^{(V)}(x) + \ldots \] \[ f''(x) = \frac{\Im\{ f(x+\mathrm{i}^{1/2}h) + f(x+\mathrm{i}^{5/2}h)\} }{h^2} - \frac{h^4}{360} f^{(VI)}(x) + \ldots \] Jacobians and Hessians can be evaluated similarly with fewer function calls. However, for small step sizes, they suffer from the round-off error just like the real-valued finite-difference counterparts. The benefit is, the complex method yields a much wider range of accurate matrix solutions. Richardson extrapolation can be used for attaining higher-order accuracy at the cost of more evaluations. Using 4 instead of 2 complex argument evaluations yields ludicrous accuracy: \[ f''(x) = \frac{\Im \left\{ 64 [f(x+\frac{\sqrt{\mathrm{i}} h}{2}) + f(x-\frac{\sqrt{\mathrm{i}} h}{2})] - f(x+\sqrt{\mathrm{i}} h) - f(x-\sqrt{\mathrm{i}} h) \right\}}{15 h^2} + \frac{h^8 f^{(X)}(x)}{1\,814\,400}, \] i.e. one millionth of the tenth derivative times the eighth power of the tiny step! See @lai2008extensions for more details. Note that this approach is not feasible in many applications where the function is not defined for complex inputs. In econometrics, densities are non-negative, probabilities are bounded, and the function in question is often the objective function of a non-linear model where many more numerical procedures are run under the hood (loops, numerical root searches, integration routines, data-driven hyper-parameter cross-validation etc.), not allowing any complex inputs. Currently, the complex method is not supported in the `pnd` package; derivations and an implementation may be done in the future to fully restore the functionality of `numDeriv`. # TODO -- Uncategorised \textcolor{red}{!!! Show that the names are lost with grad and jacobian} \textcolor{red}{!!! Compare the error of the Richardson extrapolation from a larger step with a stencil with a smaller step} \textcolor{red}{!!! Compare numDeriv with Richardson and pnd with a stencil} The results are comparable in terms of effective accuracy: e.g. the grid $x-2h, x-h, x+h, x+2h$ from `pnd` corresponds to 4 Richardson iterations from `numDeriv`, with minor differences due to the initial step size choice. \textcolor{red}{!!! Benchmark optimising SEL with 2-sided and one-sided gradients and switching at the end, compare speed and accuracy} Simulation set-up from Curtis and Reid: ```{r} f1 <- function(x) expm1(x)^2 + (1/sqrt(1+x^2) - 1)^2 df1 <- function(x) 2 * exp(x) * expm1(x) - 2*x * (1/sqrt(1+x^2) - 1) / (1 + x^2) / sqrt(1 + x^2) ddf1 <- function(x) (6 * (1/sqrt(1 + x^2) - 1) * x^2)/(1 + x^2)^(5/2) + (2 * x^2)/(1 + x^2)^3 - (2 * (1/sqrt(1 + x^2) - 1))/(1 + x^2)^(3/2) + 2 * exp(2*x) + 2 * exp(x) * expm1(x) dddf1 <- function(x) (18 * (1/sqrt(1 + x^2) - 1) * x)/(1 + x^2)^(5/2) + (6 * x)/(1 + x^2)^3 - (30 * (1/sqrt(1 + x^2) - 1) * x^3)/(1 + x^2)^(7/2) - (18 * x^3)/(1 + x^2)^4 + 6 * exp(2*x) + 2 * exp(x) * expm1(1) squish <- function(x, pow = 1/2, shift = 1) ((abs(x) + shift)^pow - shift^pow) * sign(x) unsquish <- function(y, pow = 1/2, shift = 1) ((abs(y) + shift^pow)^(1/pow) - shift) * sign(y) xgrid <- seq(-0.2, 1.5, 0.01) par(mar = c(2, 2, 0, 0) + .1) plot(xgrid, squish(f1(xgrid)), type = "l", ylim = squish(c(-1, 20)), bty = "n", lwd = 2, xlab = "", ylab = "", yaxt = "n") lines(xgrid, squish(df1(xgrid)), col = 2) lines(xgrid, squish(ddf1(xgrid)), col = 3) lines(xgrid, squish(dddf1(xgrid)), col = 4) axis(2, squish(ats <- c(-1, 0, 1, 2, 5, 10, 20)), labels = ats, las = 1) ``` ```{r eval=FALSE, include=FALSE} f1 <- function(x) x^4 try1.good <- step.SW(x = 1, f1) # Starts at h0 = 1e-5 try1.small <- step.SW(x = 1, f1, h0 = 1e-9) try1.large <- step.SW(x = 1, f1, h0 = 10) try1.huge <- step.SW(x = 1, f1, h0 = 1000) try1 <- list(try1.good, try1.small, try1.large, try1.huge) hvals1 <- sapply(try1, function(x) x$iterations$h[1]) for (i in 1:4) { cat("\n\nDiagnostics for the SW79 algorithm, f(x) = x^4, x = 1, h0 =", hvals1[i], "\n") cairo_pdf(paste0("power-", i, ".pdf"), 6.2, 4) tryCatch(par(family = "Fira Sans"), error = \(e) NULL, warning = \(w) NULL) printDiag(try1[[i]], true.val = 4, main = paste0("f(x) = x^4 with initial h = ", hvals1[i])) dev.off() } f2 <- sin try2.good <- step.SW(x = pi/4, f2) # Starts at h0 = 1e-5 try2.small <- step.SW(x = pi/4, f2, h0 = 1e-9) try2.large <- step.SW(x = pi/4, f2, h0 = 10) try2.huge <- step.SW(x = pi/4, f2, h0 = 1000) # Observe the custom warning try2 <- list(try2.good, try2.small, try2.large, try2.huge) hvals2 <- sapply(try1, function(x) x$iterations$h[1]) for (i in 1:4) { cat("\n\nDiagnostics for the SW79 algorithm, f(x) = sin(x), x = pi/4, h0 =", hvals1[i], "\n") cairo_pdf(paste0("sine-", i, ".pdf"), 6.2, 4) tryCatch(par(family = "Fira Sans"), error = \(e) NULL, warning = \(w) NULL) printDiag(try2[[i]], true.val = sqrt(2)/2, main = paste0("f(x) = sin(x) with initial h = ", hvals2[i])) dev.off() } ``` \textcolor{red}{!!! TO BE COMPLETED!} # References