---
title: "Augmented Linear Model"
author: "Ivan Svetunkov"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Augmented Linear Model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: library.bib
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align="center",
  fig.height=4,
  fig.width=6
)

library(greybox)
```

ALM stands for "Augmented Linear Model". The word "augmented" is used to reflect that the model introduces aspects that extend beyond the basic linear model. In some special cases `alm()` resembles the `glm()` function from stats package, but with a higher focus on forecasting rather than on hypothesis testing. You will not get p-values anywhere from the `alm()` function and won't see $R^2$ in the outputs. The maximum what you can count on is having confidence intervals for the parameters or for the regression line. The other important difference from `glm()` is the availability of distributions that are not supported by `glm()` (for example, Folded Normal or Box-Cox Normal distributions) and it allows optimising non-standard parameters (e.g. $\lambda$ in Asymmetric Laplace distribution). Finally, `alm()` supports different loss functions via the `loss` parameter, so you can estimate parameters of your model via, for example, likelihood maximisation or via minimisation of MSE / MAE, using LASSO / RIDGE or by minimising a loss provided by user.

Although `alm()` supports various loss functions, the core of the function is the likelihood approach. By default the estimation of parameters in the model is done via the maximisation of likelihood function of a selected distribution. The calculation of the standard errors is done based on the calculation of hessian of the distribution. And in the centre of all of that are information criteria that can be used for the models comparison.

This vignette contains the following sections:

- [Supported distributions](#distributions);
- [Scale model](#scaleModel);
- [Using different loss functions](#losses);
- [Estimation process](#estimator).

# Supported distributions {#distributions}

All the supported distributions have specific functions which form the following four groups for the `distribution` parameter in `alm()`:

1. [Density functions of continuous distributions](#continuous),
2. [Density functions for continuous non-negative data](#continuousNonNegative)
3. [Density functions for continuous positive data](#continuousPlus),
4. [Continuous distributions on a specific interval](#continuousInterval),
5. [Density functions of discrete distributions](#discrete),
6. [Cumulative functions for binary variables](#cumulative).

All of them rely on respective d- and p- functions in R. For example, Log-Normal distribution uses `dlnorm()` function from `stats` package.

The `alm()` function also supports `occurrence` parameter, which allows modelling non-zero values and the occurrence of non-zeroes as two different models. The combination of any distribution from (1) - (3) for the non-zero values and a distribution from (4) for the occurrence will result in a [mixture distribution model](#mixture), e.g. a mixture of Log-Normal and Cumulative Logistic or a Hurdle Poisson (with Cumulative Normal for the occurrence part).

Every model produced using `alm()` can be represented as:
\begin{equation} \label{eq:basicALM}
    y_t = f(\mu_t, \epsilon_t) = f(x_t' B, \epsilon_t) ,
\end{equation}
where $y_t$ is the value of the response variable, $x_t$ is the vector of exogenous variables, $B$ is the vector of the parameters, $\mu_t$ is the conditional mean (produced based on the exogenous variables and the parameters of the model), $\epsilon_t$ is the error term on the observation $t$ and $f(\cdot)$ is the distribution function that does a transformation of the inputs into the output. In case of a mixture distribution the model becomes slightly more complicated:
\begin{equation} \label{eq:basicALMMixture}
    \begin{matrix}
        y_t = o_t f(x_t' B, \epsilon_t) \\
        o_t \sim \mathrm{Bernoulli}(p_t) \\
        p_t = g(z_t' A)
    \end{matrix},
\end{equation}
where $o_t$ is the binary variable, $p_t$ is the probability of occurrence, $z_t$ is the vector of exogenous variables and $A$ is the vector of parameters for the $p_t$.

In addition, the function supports scale model, i.e. the model that predicts the values of scale of distribution (for example, variance in case of normal distribution) based on the provided explanatory variables. This is discussed in some detail in [a separate section](#scaleModel).

The `alm()` function returns, along with the set of common for `lm()` variables (such as `coefficient` and `fitted.values`), the variable `mu`, which corresponds to the conditional mean used inside the distribution, and `scale` -- the second parameter, which usually corresponds to standard error or dispersion parameter. The values of these two variables vary from distribution to distribution. Note, however, that the `model` variable returned by `lm()` function was renamed into `data` in `alm()`, and that `alm()` does not return `terms` and QR decomposition.

Given that the parameters of any model in `alm()` are estimated via likelihood, it can be assumed that they have asymptotically normal distribution, thus the confidence intervals for any model rely on the normality and are constructed based on the unbiased estimate of variance, extracted using `sigma()` function.

The covariance matrix of parameters almost in all the cases is calculated as an inverse of the hessian of respective distribution function. The exclusions are Normal, Log-Normal, Poisson, Cumulative Logistic and Cumulative Normal distributions, that use analytical solutions.

`alm()` function also supports factors in the explanatory variables, creating the set of dummies from them. In case of ordered variables (ordinal scale, `is.ordered()`), the ordering is removed and the set of dummies is produced. This is done in order to avoid the built in behaviour of R, which creates linear, squared, cubic etc levels for ordered variables, which makes the interpretation of the parameters difficult.

When the number of estimated parameters is calculated, in case of `loss=="likelihood"` the scale is considered as one of the parameters as well, which aligns with the idea of the maximum likelihood estimation. For all the other losses, the scale does not count (this aligns, for example, with how the number of parameters is calculated in OLS, which corresponds to `loss="MSE"`).

Although the basic principles of estimation of models and predictions from them are the same for all the distributions, each of the distribution has its own features. So it makes sense to discuss them individually. We discuss the distributions in the four groups mentioned above.

## Density functions of continuous distributions {#continuous}
This group of functions includes:

1. [Normal distribution](#normal),
2. [Laplace distribution](#laplace),
3. [Asymmetric Laplace distribution](#alaplace),
4. [Generalised Normal distribution](#gnormal),
5. [Logistic distribution](#logis),
6. [S distribution](#svetunkov),
7. [Student t distribution](#student),

For all the functions in this category `resid()` method returns $e_t = y_t - \mu_t$.

### Normal distribution {#normal}
The density of normal distribution $\mathcal{N}(\mu_t,\sigma)$ is:
\begin{equation} \label{eq:Normal}
    f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right) ,
\end{equation}
where $\sigma$ is the standard deviation of the error term. This PDF has a very well-known bell shape:
```{r pdfNormal, echo=FALSE}
plot(seq(-5,5,0.1),dnorm(seq(-5,5,0.1)),type="l",
     xlab="y_t",ylab="Density",main="PDF of Normal distribution")
lines(seq(-5,5,0.1),dnorm(seq(-5,5,0.1),0,2), col="blue")
lines(seq(-5,5,0.1),dnorm(seq(-5,5,0.1),1,1), col="red")
legend("topright",legend=c("N(0,1)","N(0,2)","N(1,1)"), lwd=1, col=c("black","blue","red"))
```

`alm()` with Normal distribution (`distribution="dnorm"`) is equivalent to `lm()` function from `stats` package and returns roughly the same estimates of parameters, so if you are concerned with the time of calculation, I would recommend reverting to `lm()`.

Maximising the likelihood of the model \eqref{eq:Normal} is equivalent to the estimation of the basic linear regression using Least Squares method:
\begin{equation} \label{eq:linearModel}
    y_t = \mu_t + \epsilon_t = x_t' B + \epsilon_t,
\end{equation}
where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$.

The variance $\sigma^2$ is estimated in `alm()` based on likelihood:
\begin{equation} \label{eq:sigmaNormal}
    \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(y_t - \mu_t \right)^2 ,
\end{equation}
where $T$ is the sample size. Its square root (standard deviation) is used in the calculations of `dnorm()` function, and the value is then return via `scale` variable. This value does not have bias correction. However the `sigma()` method applied to the resulting model, returns the bias corrected version of standard deviation. And `vcov()`, `confint()`, `summary()` and `predict()` rely on the value extracted by `sigma()`.

$\mu_t$ is returned as is in `mu` variable, and the fitted values are set equivalent to `mu`.

In order to produce confidence intervals for the mean (`predict(model, newdata, interval="confidence")`) the conditional variance of the model is calculated using:
\begin{equation} \label{eq:varianceNormalForCI}
    V({\mu_t}) = x_t V(B) x_t',
\end{equation}
where $V(B)$ is the covariance matrix of the parameters returned by the function `vcov`. This variance is then used for the construction of the confidence intervals of a necessary level $\alpha$ using the distribution of Student:
\begin{equation} \label{eq:intervalsNormal}
    y_t \in \left(\mu_t \pm \tau_{df,\frac{1+\alpha}{2}} \sqrt{V(\mu_t)} \right),
\end{equation}
where $\tau_{df,\frac{1+\alpha}{2}}$ is the upper ${\frac{1+\alpha}{2}}$-th quantile of the Student's distribution with $df$ degrees of freedom (e.g. with $\alpha=0.95$ it will be 0.975-th quantile, which, for example, for 100 degrees of freedom will be $\approx 1.984$).

Similarly for the prediction intervals (`predict(model, newdata, interval="prediction")`) the conditional variance of the $y_t$ is calculated:
\begin{equation} \label{eq:varianceNormalForPI}
    V(y_t) = V(\mu_t) + s^2 ,
\end{equation}
where $s^2$ is the bias-corrected variance of the error term, calculated using:
\begin{equation} \label{eq:varianceNormalUnbiased}
    s^2 = \frac{1}{T-k} \sum_{t=1}^T \left(y_t - \mu_t \right)^2 ,
\end{equation}
where $k$ is the number of estimated parameters (including the variance itself). This value is then used for the construction of the prediction intervals of a specify level, also using the distribution of Student, in a similar manner as with the confidence intervals.


### Laplace distribution {#laplace}
Laplace distribution has some similarities with the Normal one:
\begin{equation} \label{eq:Laplace}
    f(y_t) = \frac{1}{2 s} \exp \left( -\frac{\left| y_t - \mu_t \right|}{s} \right) ,
\end{equation}
where $s$ is the scale parameter, which, when estimated using likelihood, is equal to the mean absolute error:
\begin{equation} \label{eq:bLaplace}
    \hat{s} = \frac{1}{T} \sum_{t=1}^T \left| y_t - \mu_t \right| .
\end{equation}
So maximising the likelihood \eqref{eq:Laplace} is equivalent to estimating the linear regression \eqref{eq:linearModel} via the minimisation of $s$ \eqref{eq:bLaplace}. So when estimating a model via minimising $s$, the assumption imposed on the error term is $\epsilon_t \sim \mathcal{Laplace}(0, s)$. The main difference of Laplace from Normal distribution is its fatter tails, the PDF has the following shape:
```{r pdfLaplace, echo=FALSE}
plot(seq(-5,5,0.1),dlaplace(seq(-5,5,0.1)),type="l",
     xlab="y_t",ylab="Density",main="PDF of Laplace distribution")
lines(seq(-5,5,0.1),dlaplace(seq(-5,5,0.1),0,2), col="blue")
lines(seq(-5,5,0.1),dlaplace(seq(-5,5,0.1),1,1), col="red")
legend("topright",legend=c("Laplace(0,1)","Laplace(0,2)","Laplace(1,1)"), lwd=1, col=c("black","blue","red"))
```

`alm()` function with `distribution="dlaplace"` returns `mu` equal to $\mu_t$ and the fitted values equal to `mu`. $s$ is returned in the `scale` variable. The prediction intervals are derived from the quantiles of Laplace distribution after transforming the conditional variance into the conditional scale parameter $s$ using the connection between the two in Laplace distribution:
\begin{equation} \label{eq:bLaplaceAndSigma}
    s = \sqrt{\frac{\sigma^2}{2}},
\end{equation}
where $\sigma^2$ is substituted either by the conditional variance of $\mu_t$ or $y_t$.

The kurtosis of Laplace distribution is 6, making it suitable for modelling rarely occurring events.


### Asymmetric Laplace distribution {#alaplace}
Asymmetric Laplace distribution can be considered as a two Laplace distributions with different parameters $s$ for left and right side. There are several ways to summarise the probability density function, the one used in `alm()` relies on the asymmetry parameter $\alpha$ [@Yu2005]:
\begin{equation} \label{eq:ALaplace}
    f(y_t) = \frac{\alpha (1- \alpha)}{s} \exp \left( -\frac{y_t - \mu_t}{s} (\alpha - I(y_t \leq \mu_t)) \right) ,
\end{equation}
where $s$ is the scale parameter, $\alpha$ is the skewness parameter and $I(y_t \leq \mu_t)$ is the indicator function, which is equal to one, when the condition is satisfied and to zero otherwise. The scale parameter $s$ estimated using likelihood is equal to the quantile loss:
\begin{equation} \label{eq:bALaplace}
    \hat{s} = \frac{1}{T} \sum_{t=1}^T \left(y_t - \mu_t \right)(\alpha - I(y_t \leq \mu_t)) .
\end{equation}
Thus maximising the likelihood \eqref{eq:ALaplace} is equivalent to estimating the linear regression \eqref{eq:linearModel} via the minimisation of $\alpha$ quantile, making this equivalent to quantile regression. So quantile regression models assume indirectly that the error term is $\epsilon_t \sim \mathcal{ALaplace}(0, s, \alpha)$ [@Geraci2007]. The advantage of using `alm()` in this case is in having the full distribution, which allows to do all the fancy things you can do when you have likelihood.

Graphically, the PDF of asymmetric Laplace is:
```{r pdfALaplace, echo=FALSE}
plot(seq(-5,5,0.1),dalaplace(seq(-5,5,0.1),0,0.5),type="l",
     xlab="y_t",ylab="Density",main="PDF of Asymmetric Laplace distribution")
lines(seq(-5,5,0.1),dalaplace(seq(-5,5,0.1),0,1,0.25), col="blue")
lines(seq(-5,5,0.1),dalaplace(seq(-5,5,0.1),0,1,0.75), col="red")
legend("topright",legend=c("ALaplace(0,0.5,0.5)","ALaplace(0,1,0.25)","ALaplace(0,1,0.75)"), lwd=1, col=c("black","blue","red"))
```

In case of $\alpha=0.5$ the function reverts to the symmetric Laplace where $s=\frac{1}{2}\text{MAE}$.

`alm()` function with `distribution="dalaplace"` accepts an additional parameter `alpha` in ellipsis, which defines the quantile $\alpha$. If it is not provided, then the function will estimated it maximising the likelihood and return it as the first coefficient. `alm()` returns `mu` equal to $\mu_t$ and the fitted values equal to `mu`. $s$ is returned in the `scale` variable. The parameter $\alpha$ is returned in the variable `other` of the final model. The prediction intervals are produced using `qalaplace()` function. In order to find the values of $s$ for the holdout the following connection between the variance of the variable and the scale in Asymmetric Laplace distribution is used:
\begin{equation} \label{eq:bALaplaceAndSigma}
    s = \sqrt{\sigma^2 \frac{\alpha^2 (1-\alpha)^2}{(1-\alpha)^2 + \alpha^2}},
\end{equation}
where $\sigma^2$ is substituted either by the conditional variance of $\mu_t$ or $y_t$.

**NOTE**: in order for the Asymmetric Laplace to work well, you might need to have large samples. This is inherited from the pinball score of the quantile regression. If you fit the model on 40 observations with $\alpha=0.05$, you will only have 2 observations below the line, which does not help very much with the fit. Similarly, the covariance matrix, produced via the Hessian might not be adequate in this situation (because there is not enough variability in the data due to extreme value of $\alpha$). The latter can be partially addressed by using bootstrap, but do not expect miracles on small samples.


### S distribution {#svetunkov}
The S distribution has the following density function:
\begin{equation} \label{eq:S}
    f(y_t) = \frac{1}{4 s^2} \exp \left( -\frac{\sqrt{|y_t - \mu_t|}}{s} \right) ,
\end{equation}
where $s$ is the scale parameter. If estimated via maximum likelihood, the scale parameter is equal to:
\begin{equation} \label{eq:bS}
    \hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{\left| y_t - \mu_t \right|} ,
\end{equation}
which corresponds to the minimisation of a half of "Mean Root Absolute Error" or "Half Absolute Moment".

S distribution has a kurtosis of 25.2, which makes it a "severe excess" distribution (thus the name). It might be useful in cases of randomly occurring incidents and extreme values (Black Swans?). Here how the PDF looks:
```{r pdfS, echo=FALSE}
plot(seq(-5,5,0.01),ds(seq(-5,5,0.01),0,1),type="l",
     xlab="y_t",ylab="Density",main="PDF of S distribution")
lines(seq(-5,5,0.01),ds(seq(-5,5,0.01),0,2), col="blue")
lines(seq(-5,5,0.01),ds(seq(-5,5,0.01),1,1), col="red")
legend("topright",legend=c("S(0,1)","S(0,2)","S(1,1)"), lwd=1, col=c("black","blue","red"))
```

`alm()` function with `distribution="ds"` returns $\mu_t$ in the same variables `mu` and `fitted.values`, and $s$ in the `scale` variable. Similarly to the previous functions, the prediction intervals are based on the `qs()` function from `greybox` package and use the connection between the scale and the variance:
\begin{equation} \label{eq:bSAndSigma}
    s = \left( \frac{\sigma^2}{120} \right) ^{\frac{1}{4}},
\end{equation}
where once again $\sigma^2$ is substituted either by the conditional variance of $\mu_t$ or $y_t$.


### Generalised Normal distribution {#gnormal}
The Generalised Normal distribution is a generalisation, which has Normal, Laplace and S as special cases. It has the following density function:
\begin{equation} \label{eq:gnormal}
    f(y_t) = \frac{\beta}{2s \Gamma(1/\beta)}\exp\left(-\left(\frac{|y_t - \mu|}{s}\right)^\beta\right),
\end{equation}
where $s$ is the scale and $\beta$ is the shape parameters. If estimated via maximum likelihood, the scale parameter is equal to:
\begin{equation} \label{eq:gnormalScale}
    \hat{s} = \sqrt[^\beta]{\frac{\beta}{T} \sum_{t=1}^T \left| y_t - \mu_t \right|^{\beta}} .
\end{equation}
In the special cases, this becomes either $\sqrt{2}\times$RMSE ($\beta=2$), or MAE ($\beta=1$) or a half of HAM ($\beta=0.5$). It is important to note that although in case of $\beta=2$, the distribution becomes equivalent to Normal, the scale of it will differ from the $\sigma$ (this follows directly from the formula above). The relations between the two is: $s^2 = 2 \sigma^2$.
```{r pdfgnorm, echo=FALSE}
plot(seq(-5,5,0.1),dgnorm(seq(-5,5,0.1),0,1,2),type="l",
     xlab="y_t",ylab="Density",main="PDF of Generalised Normal distribution")
lines(seq(-5,5,0.1),dgnorm(seq(-5,5,0.1),0,1,1), col="blue")
lines(seq(-5,5,0.01),dgnorm(seq(-5,5,0.01),0,1,0.5), col="red")
lines(seq(-5,5,0.1),dgnorm(seq(-5,5,0.1),0,1,100), col="purple")
legend("topright",legend=c("GN(0,1,2)","GN(0,1,1)","GN(0,1,0.5)","GN(0,1,100)"),
       lwd=1, col=c("black","blue","red","purple"))
```

The kurtosis of Generalised Normal distribution is determined by $\beta$ and is equal to $\frac{\Gamma(5/\beta)\Gamma(1/\beta)}{\Gamma(3/\beta)^2}$.

`alm()` function with `distribution="dgnorm"` returns $\mu_t$ in the same variables `mu` and `fitted.values`, $s$ in the `scale` variable and $\beta$ in `other$beta`. Note that if `beta` is not provided in the function, then it will estimate it. However, the estimates of $\beta$ are known not to be consistent and asymptotically normal if it is less than 2. **So, use with care!** As for the intervals, they are based on the `qgnorm()` function from `greybox` package and use the connection between the scale and the variance:
\begin{equation} \label{eq:gnormalAlphaAndSigma}
    s = \left( \frac{\sigma^2 \Gamma(1/\beta)}{\Gamma(3/\beta)} \right) ^{\frac{1}{2}},
\end{equation}
where once again $\sigma^2$ is substituted either by the conditional variance of $\mu_t$ or $y_t$, depending on what type of interval is needed.


### Logistic distribution {#logis}
The density function of Logistic distribution is:
\begin{equation} \label{eq:Logistic}
    f(y_t) = \frac{\exp \left(- \frac{y_t - \mu_t}{s} \right)} {s \left( 1 + \exp \left(- \frac{y_t - \mu_t}{s} \right) \right)^{2}},
\end{equation}
where $s$ is the scale parameter, which is estimated in `alm()` based on the connection between the parameter and the variance in the logistic distribution:
\begin{equation} \label{eq:sLogisticAndSigma}
    \hat{s} = \sigma \frac{\sqrt{3}}{\pi}.
\end{equation}
Once again the maximisation of \eqref{eq:Logistic} implies the estimation of the linear model \eqref{eq:linearModel}, where $\epsilon_t \sim \mathcal{Logistic}(0, s)$.
```{r pdfLogis, echo=FALSE}
plot(seq(-5,5,0.1),dlogis(seq(-5,5,0.1),0,0.5),type="l",
     xlab="y_t",ylab="Density",main="PDF of Logistic distribution")
lines(seq(-5,5,0.1),dlogis(seq(-5,5,0.1),0,2), col="blue")
lines(seq(-5,5,0.1),dlogis(seq(-5,5,0.1),1,1), col="red")
legend("topright",legend=c("Logis(0,0.5)","Logis(0,1)","Logis(1,0.5)"), lwd=1, col=c("black","blue","red"))
```

Logistic is considered a fat tailed distribution, but its tails are not as fat as in Laplace. Kurtosis of standard Logistic is 4.2.

`alm()` function with `distribution="dlogis"` returns $\mu_t$ in `mu` and in `fitted.values` variables, and $s$ in the `scale` variable. Similar to Laplace distribution, the prediction intervals use the connection between the variance and scale, and rely on the `qlogis` function.

### Student t distribution {#student}
The Student t distribution has a difficult density function:
\begin{equation} \label{eq:T}
    f(y_t) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} \left( 1 + \frac{x^2}{\nu} \right)^{-\frac{\nu+1}{2}} ,
\end{equation}
where $\nu$ is the number of degrees of freedom, which can also be considered as the scale parameter of the distribution. It has the following connection with the in-sample variance of the error (but only for the case, when $\nu>2$):
\begin{equation} \label{eq:scaleOfT}
    \nu = \frac{2}{1-\sigma^{-2}}.
\end{equation}
```{r pdfStudent, echo=FALSE}
plot(seq(-5,5,0.1),dt(seq(-5,5,0.1),100),type="l",
     xlab="y_t",ylab="Density",main="PDF of Student's t distribution")
lines(seq(-5,5,0.1),dt(seq(-5,5,0.1),10), col="blue")
lines(seq(-5,5,0.1),dt(seq(-5,5,0.1),1), col="red")
legend("topright",legend=c("t(100)","t(10)","t(1)"), lwd=1, col=c("black","blue","red"))
```


Kurtosis of Student t distribution depends on the value of $\nu$, and for the cases of $\nu>4$ is equal to $\frac{6}{\nu-4}$. When the $\mu \rightarrow \infty$, the distribution converges to the normal.

`alm()` function with `distribution="dt"` estimates the parameters of the model along with the $\nu$ (if it is not provided by the user as a `nu` parameter) and returns $\mu_t$ in the variables `mu` and `fitted.values`, and $\nu$ in the `scale` variable. Both prediction and confidence intervals use `qt()` function from `stats` package and rely on the conventional number of degrees of freedom $T-k$. The intervals are constructed similarly to how it is done in Normal distribution \eqref{eq:intervalsNormal} (based on `qt()` function).

### An example of application

In order to see how this works, we will create the following data:
```{r normalDistributionData}
set.seed(41, kind="L'Ecuyer-CMRG")
xreg <- cbind(rnorm(200,10,3),rnorm(200,50,5))
xreg <- cbind(500+0.5*xreg[,1]-0.75*xreg[,2]+rs(200,0,3),xreg,rnorm(200,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")

inSample <- xreg[1:180,]
outSample <- xreg[-c(1:180),]
```

ALM can be run either with data frame or with matrix. Here's an example with normal distribution and several levels for the construction of prediction interval:
```{r normalRegression}
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnorm")
summary(ourModel)
plot(predict(ourModel,outSample,interval="p",level=c(0.9,0.95)))
```

And here's an example with Asymmetric Laplace and predefined $\alpha=0.95$:
```{r ALaplaceRegression}
ourModel <- alm(y~x1+x2, data=inSample, distribution="dalaplace",alpha=0.95)
summary(ourModel)
plot(predict(ourModel,outSample))
```


## Density functions for continuous non-negative data {#continuousNonNegative}
There are currently three distributions in this group:

1. [Box-Cox Normal distribution](#bcnormal),
2. [Folded Normal distribution](#fnormal),
3. [Rectified Normal distribution](#rectnormal)

They allow the response variable to be positive or zero. Note however that the PDF of the Box-Cox Normal distribution is equal to zero in case of $y_t=0$, which might cause some issues in the estimation.


### Box-Cox Normal distribution {#bcnormal}
Box-Cox Normal distribution used in the `greybox` package is defined as a distribution that becomes normal after the Box-Cox transformation. This means that if $x=\frac{y^\lambda+1}{\lambda}$ and $x \sim \mathcal{N}(\mu, \sigma^2)$, then $y \sim \text{BC}\mathcal{N}(\mu, \sigma^2)$. The density function of the Box-Cox Normal distribution in this case is:
\begin{equation} \label{eq:BCNormal}
    f(y_t) = \frac{y_t^{\lambda-1}} {\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(\frac{y_t^{\lambda}-1}{\lambda} - \mu_t \right)^2}{2 \sigma^2} \right) ,
\end{equation}
where the variance estimated using likelihood is:
\begin{equation} \label{eq:sigmaBCNormal}
    \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\frac{y_t^{\lambda}-1}{\lambda} - \mu_t \right)^2 .
\end{equation}
Depending on the value of $\lambda$, we will get different shapes of the density function:
```{r pdfBCNorm, echo=FALSE}
plot(seq(0,5,0.1),dbcnorm(seq(0,5,0.1),0,1,1),type="l",ylim=c(0,1),
     xlab="y_t",ylab="Density",main="PDF of Box-Cox Normal distribution")
lines(seq(0.01,5,0.01),dbcnorm(seq(0.01,5,0.01),0,1,0.5), col="blue")
lines(seq(0,5,0.1),dbcnorm(seq(0,5,0.1),0,1,2), col="red")
lines(seq(0,5,0.01),dbcnorm(seq(0,5,0.01),0,1,0.01), col="purple")
legend("topright",legend=c("BCN(0,1,1)","BCN(0,1,0.5)","BCN(0,1,2)","BCN(0,1,0.01)"),
       lwd=1, col=c("black","blue","red","purple"))
```

When $\lambda=0$ the distribution transforms to the [Log-Normal](#lnormal) one.

Estimating the model with Box-Cox Normal distribution is equivalent to estimating the parameters of a linear model after the Box-Cox transform:
\begin{equation} \label{eq:BCLinearModel}
    \frac{y_t^{\lambda}-1}{\lambda} = \mu_t + \epsilon_t,
\end{equation}
where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ or:
\begin{equation} \label{eq:BCLinearModelExp}
    y_t = \left((\mu_t + \epsilon_t) \lambda +1 \right)^{\frac{1}{\lambda}}.
\end{equation}

`alm()` with `distribution="dbcnorm"` does not transform the provided data and estimates the density directly using `dbcnorm()` function from `greybox` with the estimated mean $\mu_t$ and the variance \eqref{eq:sigmaBCNormal}. The $\mu_t$ is returned in the variable `mu`, the $\sigma^2$ is in the variable `scale`, while the `fitted.values` contains the exponent of $\mu_t$, which, given the connection between the Normal and Box-Cox Normal distributions, corresponds to median of distribution rather than mean. Finally, `resid()` method returns $e_t = \frac{y_t^{\lambda}-1}{\lambda} - \mu_t$. The $lambda$ parameter can be provided by the user via the `lambdaBC` in ellipsis.


### Folded Normal distribution {#fnormal}
Folded Normal distribution is obtained when the absolute value of normally distributed variable is taken: if $x \sim \mathcal{N}(\mu, \sigma^2)$, then $|x| \sim \text{Fold}\mathcal{N}(\mu, \sigma^2)$. The density function is:
\begin{equation} \label{eq:foldedNormal}
    f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2}} \left( \exp \left( -\frac{\left(y_t - \mu_t \right)^2}{2 \sigma^2} \right) + \exp \left( -\frac{\left(y_t + \mu_t \right)^2}{2 \sigma^2} \right) \right),
\end{equation}
which can be graphically represented as:
```{r pdfFnorm, echo=FALSE}
plot(seq(0.01,5,0.01),dfnorm(seq(0.01,5,0.01),0,1),type="l",ylim=c(0,1.5),
     xlab="y_t",ylab="Density",main="PDF of Folded Normal distribution")
lines(seq(0.01,5,0.01),dfnorm(seq(0.01,5,0.01),-1,1), col="blue")
lines(seq(0.01,5,0.01),dfnorm(seq(0.01,5,0.01),-2,1), col="red")
legend("topright",legend=c("FN(0,1)","FN(-1,1)","FN(-2,1)"),
       lwd=1, col=c("black","blue","red"))
```

Conditional mean and variance of Folded Normal are estimated in `alm()` (with `distribution="dfnorm"`) similarly to how this is done for Normal distribution. They are returned in the variables `mu` and `scale` respectively. In order to produce the fitted value (which is returned in `fitted.values`), the following correction is done:
\begin{equation} \label{eq:foldedNormalFitted}
    \hat{y_t} = \sqrt{\frac{2}{\pi}} \sigma \exp \left( -\frac{\mu_t^2}{2 \sigma^2} \right) + \mu_t \left(1 - 2 \Phi \left(-\frac{\mu_t}{\sigma} \right) \right),
\end{equation}
where $\Phi(\cdot)$ is the CDF of Normal distribution.

The model that is assumed in the case of Folded Normal distribution can be summarised as:
\begin{equation} \label{eq:foldedNormalModel}
    y_t = \left| \mu_t + \epsilon_t \right|.
\end{equation}

The conditional variance of the forecasts is calculated based on the elements of `vcov()` (as in all the other functions), the predicted values are corrected in the same way as the fitted values \eqref{eq:foldNormalFitted}, and the prediction intervals are generated from the `qfnorm()` function of `greybox` package. As for the residuals, `resid()` method returns $e_t = y_t - \mu_t$.


### Rectified Normal distribution {#rectnormal}
Rectified Normal distribution is obtained when all the negative values of normally distributed variable are set to zero: if $x \sim \mathcal{N}(\mu, \sigma)$, then $y = \max(0, x) \sim \text{Rect}\mathcal{N}(\mu, \sigma)$. The density function is:

\begin{equation} \label{eq:rectnormal}
    f(y_t) = I(y_t = 0) \Phi_x(0, \mu, \sigma) + I(y_t > 0) \phi_x(y_t, \mu, \sigma),
\end{equation}
where $\Phi_x(0, \mu, \sigma)$ is the CDF and $\phi_x(y_t, \mu, \sigma)$ is the PDF of the Normal distribution. This can be graphically represented as:
```{r pdfRectnormal, echo=FALSE}
plot(seq(0,5,0.01),drectnorm(seq(0,5,0.01),0,1),type="l",ylim=c(0,1.5),
     xlab="y_t",ylab="Density",main="PDF of REctified Normal distribution",
     xlim=c(0,2))
points(0,drectnorm(0,0,1),col="black")
lines(seq(0,5,0.001),drectnorm(seq(0,5,0.001),1,1), col="blue")
points(0,drectnorm(0,1,1),col="blue")
lines(seq(0,5,0.01),drectnorm(seq(0,5,0.01),-1,1), col="red")
points(0,drectnorm(0,-1,1),col="red")
legend("topright",legend=c("RectN(0,1)","RectN(1,1)","RectN(-1,1)"),
       lwd=1, col=c("black","blue","red"))
```

This distribution can be useful in modelling intermittent demand, when the demand sizes are not integer.

Conditional location and scale of Rectified Normal are estimated in `alm()` (with `distribution="drectnorm"`) similarly to how this is done for Normal distribution. They are returned in the variables `mu` and `scale` respectively. In order to produce the fitted value (which is returned in `fitted.values`), the following formula is used:
\begin{equation} \label{eq:rectNormalFitted}
    \hat{y_t} = \mu_t (1-\Phi_x(0, \mu, \sigma)) + \sigma * \phi_x(0, \mu, \sigma) .
\end{equation}

The model that is assumed in the case of Rectified Normal distribution is:
\begin{equation} \label{eq:rectifiedNormalModel}
    y_t = \max(\mu_t + \epsilon_t, 0).
\end{equation}

The conditional variance of the forecasts is calculated based on the elements of `vcov()` (as in all the other functions), the predicted values are corrected in the same way as the fitted values \eqref{eq:foldNormalFitted}, and the prediction intervals are generated from the `qrectnorm()` function of `greybox` package. As for the residuals, `resid()` method returns $e_t = y_t - \mu_t$.


## Density functions for continuous positive data {#continuousPlus}
This group includes:

1. [Log-Normal distribution](#lnormal),
2. [Inverse Gaussian distribution](#invgauss),
3. [Gamma distribution](#gamma),
4. [Exponential distribution](#exponential),
5. [Log-Laplace distribution](#llaplace),
6. [Log-S distribution](#lsvetunkov),
7. [Log-Generalised Normal distribution](#lgnormal),

<!-- Although (2) and (3) in theory allow having zeroes in data, given that the density function is equal to zero in any specific point, it will be zero in these cases as well. So the `alm()` will return some solutions for these distributions, but don't expect anything good. As for (1), it supports strictly positive data. -->

### Log-Normal distribution {#lnormal}
Log-Normal distribution appears when a normally distributed variable is exponentiated. This means that if $x \sim \mathcal{N}(\mu, \sigma^2)$, then $\exp x \sim \text{log}\mathcal{N}(\mu, \sigma^2)$. The density function of Log-Normal distribution is:
\begin{equation} \label{eq:LogNormal}
    f(y_t) = \frac{1}{y_t \sqrt{2 \pi \sigma^2}} \exp \left( -\frac{\left(\log y_t - \mu_t \right)^2}{2 \sigma^2} \right) ,
\end{equation}
where the variance estimated using likelihood is:
\begin{equation} \label{eq:sigmaLogNormal}
    \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\log y_t - \mu_t \right)^2 .
\end{equation}
The PDF has the following shape:
```{r pdflognorm, echo=FALSE}
plot(seq(0,5,0.01),dlnorm(seq(0,5,0.01),0,1),type="l",
     xlab="y_t",ylab="Density",ylim=c(0,1.5),main="PDF of Log-Normal distribution")
lines(seq(0,5,0.01),dlnorm(seq(0,5,0.01),1,1), col="blue")
lines(seq(0,5,0.01),dlnorm(seq(0,5,0.01),0,2), col="red")
legend("topright",legend=c("logN(0,1)","logN(1,1)","logN(0,2)"), lwd=1, col=c("black","blue","red"))
```

Estimating the model with Log-Normal distribution is equivalent to estimating the parameters of log-linear model:
\begin{equation} \label{eq:logLinearModel}
    \log y_t = \mu_t + \epsilon_t,
\end{equation}
where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ or:
\begin{equation} \label{eq:logLinearModelExp}
    y_t = \exp(\mu_t + \epsilon_t).
\end{equation}

`alm()` with `distribution="dlnorm"` does not transform the provided data and estimates the density directly using `dlnorm()` function with the estimated mean $\mu_t$ and the variance \eqref{eq:sigmaLogNormal}. If you need a log-log model, then you would need to take logarithms of the external variables. The $\mu_t$ is returned in the variable `mu`, the $\sigma^2$ is in the variable `scale`, while the `fitted.values` contains the exponent of $\mu_t$, which, given the connection between the Normal and Log-Normal distributions, corresponds to median of distribution rather than mean. Finally, `resid()` method returns $e_t = \log y_t - \mu_t$.


### Inverse Gaussian distribution {#invgauss}
Inverse Gaussian distribution is an interesting distribution, which is defined for positive values only and has some properties similar to the properties of the Normal distribution. It has two parameters: location $\mu_t$ and scale $\phi$ (aka "dispersion"). There are different ways to parameterise this distribution, we use the dispersion-based one. The important thing that distinguishes the implementation in `alm()` from the one in `glm()` or in any other function is that we assume that the model has the following form:
\begin{equation} \label{eq:InverseGaussianModel}
    y_t = \mu_t \times \epsilon_t 
\end{equation}
and that $\epsilon_t \sim \mathcal{IG}(1, \phi)$. This means that $y_t \sim \mathcal{IG}\left(\mu_t, \frac{\phi}{\mu_t} \right)$, implying that the dispersion of the model changes together with the conditional expectation. The density function for the error term in this case is:
\begin{equation} \label{eq:InverseGaussian}
    f(\epsilon_t) = \frac{1}{\sqrt{2 \pi \phi \epsilon_t^3}} \exp \left( -\frac{\left(\epsilon_t - 1 \right)^2}{2 \phi \epsilon_t} \right) ,
\end{equation}
where the dispersion parameter is estimated via maximising the likelihood and is calculated using:
\begin{equation} \label{eq:InverseGaussianDispersion}
    \hat{\phi} = \frac{1}{T} \sum_{t=1}^T \frac{\left(\epsilon_t - 1 \right)^2}{\epsilon_t} .
\end{equation}
Note that in our formulation $\mu_t = \exp\left( x_t' B \right)$, so that the mean is always positive. This implies that we deal with a pure multiplicative model. In addition, we assume that $\mu_t$ is just a scale for the distribution, otherwise $y_t$ would not follow the Inverse Gaussian distribution. The density function has following shapes depending on the values of parameters:
```{r pdfIG, echo=FALSE}
library(statmod)
plot(seq(0,5,0.01),dinvgauss(seq(0,5,0.01),1,1),type="l",
     xlab="y_t",ylab="Density",main="PDF of Inverse Gaussian distribution")
lines(seq(0.01,5,0.01),dinvgauss(seq(0.01,5,0.01),1,2), col="blue")
lines(seq(0,5,0.01),dinvgauss(seq(0,5,0.01),2,1), col="red")
legend("topright",legend=c("IG(1,1)","IG(1,2)","IG(2,1)"),
       lwd=1, col=c("black","blue","red","purple"))
```

`alm()` with `distribution="dinvgauss"` estimates the density for $y_t$ using `dinvgauss()` function from `statmod` package. The $\mu_t$ is returned in the variables `mu` and `fitted.values`, the dispersion $\phi$ is in the variable `scale`. `resid()` method returns $e_t = \frac{y_t}{\mu_t}$. Finally, the prediction interval for the regression model are generated using `qinvgauss()` function from the `statmod` package.


### Gamma distribution {#gamma}
Another popular distribution, defined for positive values only is called "Gamma". It is parametrised via the shape $k$ and scale $\sigma^2$ and has closed forms for mean and variance: $\mathrm{E}(x)=k \sigma^2$, $\mathrm{V}(x)=k \sigma^4$.

The important thing that distinguishes the implementation in `alm()` from the one in `glm()` or in any other function is that we assume that the model has the following form (similar to the [Inverse Gaussian](#invgauss) model in alm):
\begin{equation*}
    y_t = \mu_t \times \epsilon_t 
\end{equation*}
and that $\epsilon_t \sim \Gamma \left(\sigma^{-2}, \sigma^2 \right)$, implying that $\mathrm{E}(\epsilon_t)= k \sigma^2 = 1$ and $\mathrm{V}(\epsilon_t)=\sigma^2$. This means that $y_t \sim \Gamma\left(\sigma^{-2}, \sigma^2 \mu_t \right)$, meaning that the variance of the model changes together with the conditional expectation. The density function for the error term in this case is:
\begin{equation} \label{eq:Gamma}
    f(\epsilon_t) = \frac{1}{\Gamma(\sigma^{-2}) (\sigma^{2})^{\sigma^{-2}}} \epsilon_t^{\sigma^{-2}-1}\exp \left(-\frac{\epsilon_t}{\sigma^2}\right),
\end{equation}
where the scale parameter $\sigma^2$ can be estimated via the method of moments based on its relation to the variance:
\begin{equation} \label{eq:GammaDispersion}
    \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\epsilon_t - 1 \right)^2.
\end{equation}
Note that in our formulation $\mu_t = \exp\left( x_t' B \right)$, so that the mean is always positive, which implies that we deal with a pure multiplicative model. In addition, we assume that $\mu_t$ is just a scale for the distribution, otherwise $y_t$ would not follow Gamma distribution. All of this makes the model restrictive, but arguably reasonable - otherwise the mean of the distribution might behave uncontrollably.

The density function has following shapes depending on the values of parameters:
```{r pdfGamma, echo=FALSE}
library(statmod)
plot(seq(0,5,0.01),dgamma(seq(0,5,0.01),shape=1,scale=1),type="l",
     xlab="y_t",ylab="Density",main="PDF of Gamma distribution")
lines(seq(0.01,5,0.01),dgamma(seq(0.01,5,0.01),shape=0.5,scale=2), col="blue")
lines(seq(0,5,0.01),dgamma(seq(0,5,0.01),shape=2,scale=0.5), col="red")
legend("topright",legend=c("Gamma(1,1)","Gamma(0.5,2)","Gamma(2,0.5)"),
       lwd=1, col=c("black","blue","red","purple"))
```

`alm()` with `distribution="dgamma"` estimates the density for $y_t$ using `dgamma()` function from `stats` package. The $\mu_t$ is returned in the variables `mu` and `fitted.values`, the scale $\sigma^2$ is in the variable `scale`. `resid()` method returns $e_t = \frac{y_t}{\mu_t}$. Finally, the prediction interval for the regression model are generated using `qgamma()` function from the `stats` package.


### Exponential distribution {#exponential}
One peculiar and very specific distribution, which can also be used in modelling is Exponential distribution. It only has one parameter, $\lambda$, which regulates both mean and variance:
\begin{equation*}
  \begin{aligned}
    & x \sim \mathrm{Exp}(\lambda) \\
    & \mathrm{E}(x) = \frac{1}{\lambda} \\
    & \mathrm{V}(x) = \frac{1}{\lambda^2}
  \end{aligned} .
\end{equation*}
It might be useful in cases, when one wants to model inter-arrival times.

The implementation in `alm()` relies on the model, similar to the [Inverse Gaussian](#invgauss) and [Gamma](#gamma) models:
\begin{equation*}
    y_t = \mu_t \times \epsilon_t ,
\end{equation*}
where $\epsilon_t \sim \mathrm{Exp} \left(1 \right)$, implying that $\mathrm{E}(\epsilon_t) = \mathrm{V}(\epsilon_t) = 1$. This is a very restrictive model, which only works in some special cases. If for some reason the variance and mean are not equal to one in the empirical distribution, then the Exponential one would not be appropriate. But in general the model formulated as above implies that $y_t \sim \mathrm{Exp}\left( \frac{1}{\mu_t} \right)$, meaning that the variance of the model changes together with the conditional expectation. The density function for the error term in this case is:
\begin{equation} \label{eq:Exp}
    f(\epsilon_t) = \exp(-\epsilon_t).
\end{equation}
Note that in our formulation $\mu_t = \exp\left( x_t' B \right)$, so that the mean is always positive, which implies that we deal with a pure multiplicative model. In addition, we assume that $\mu_t$ is just a scale for the distribution, otherwise $y_t$ would not follow Exponential distribution.

The density function has the following shapes depending on the values of the expectation:
```{r pdfExp, echo=FALSE}
library(statmod)
plot(seq(0,5,0.01),dexp(seq(0,5,0.01),rate=1),type="l",
     xlab="y_t",ylab="Density",main="PDF of Exponential distribution")
lines(seq(0.01,5,0.01),dexp(seq(0.01,5,0.01),rate=0.5), col="blue")
lines(seq(0,5,0.01),dexp(seq(0,5,0.01),rate=2), col="red")
legend("topright",legend=c("Exp(1), mu=1","Exp(0.5), mu=2","Exp(2), mu=0.5"),
       lwd=1, col=c("black","blue","red","purple"))
```

`alm()` with `distribution="dexp"` estimates the density for $y_t$ using `dexp()` function from `stats` package. The $\mu_t$ is returned in the variables `mu` and `fitted.values`, the scale is assumed to be equal to one. `resid()` method returns $e_t = \frac{y_t}{\mu_t}$. Finally, the prediction interval for the regression model are generated using `qexp()` function from the `stats` package.

**NOTE** that if the assumption of $\mathrm{E}(\epsilon_t) = \mathrm{V}(\epsilon_t) = 1$ does not hold, the model will produce unreasonable quantiles.


### Log-Laplace distribution {#llaplace}
This is based on the exponent of [Laplace distribution](#laplace), which means that the PDF in this case is:
\begin{equation} \label{eq:lLaplace}
    f(y_t) = \frac{1}{2 s y_t} \exp \left( -\frac{\left| \log y_t - \mu_t \right|}{s} \right) .
\end{equation}
The model implemented in the package has similarity with [Log-Normal distribution](#lnormal). The MLE scale is:
\begin{equation} \label{eq:bLogLaplace}
    \hat{s} = \frac{1}{T} \sum_{t=1}^T \left|\log y_t - \mu_t \right| .
\end{equation}
The density function of Log-Laplace has the following shapes:
```{r pdflogLaplace, echo=FALSE}
plot(seq(0.01,5,0.01),dlaplace(log(seq(0.01,5,0.01)),0,1)/seq(0.01,5,0.01),type="l",ylim=c(0,1.5),
     xlab="y_t",ylab="Density",main="PDF of Log-Laplace distribution")
lines(seq(0.01,5,0.01),dlaplace(log(seq(0.01,5,0.01)),0,2)/seq(0.01,5,0.01), col="blue")
lines(seq(0.01,5,0.01),dlaplace(log(seq(0.01,5,0.01)),1,1)/seq(0.01,5,0.01), col="red")
legend("topright",legend=c("logLaplace(0,1)","logLaplace(0,2)","logLaplace(1,1)"),
       lwd=1, col=c("black","blue","red"))
```


Estimating the model with Log-Laplace distribution is equivalent to estimating the parameters of log-linear model:
\begin{equation*}
    \log y_t = \mu_t + \epsilon_t,
\end{equation*}
where $\epsilon_t \sim \mathcal{Laplace}(0, \sigma^2)$. This distribution might be useful if the data has a strong skewness (larger than in case of Log-Normal distribution).

`alm()` with `distribution="dllaplace"` uses `dlaplace()` function with the logarithm of actual values, estimated mean $\mu_t$ and the scale \eqref{eq:sigmaLogLaplace}. The $\mu_t$ is returned in the variable `mu`, the $s$ is in the variable `scale`, while the `fitted.values` contains the exponent of $\mu_t$, which corresponds to median of distribution rather than mean. Finally, `resid()` method returns $e_t = \log y_t - \mu_t$.


### Log-S distribution {#lsvetunkov}
This is based on the exponent of [S distribution](#svetunkov), giving the PDF:
\begin{equation} \label{eq:ls}
    f(y_t) = \frac{1}{4 y_t s^2} \exp \left( -\frac{\sqrt{|\log y_t - \mu_t|}}{s} \right) ,
\end{equation}
The model implemented in the package has similarity with [Log-Normal](#lnormal) and [Log-Laplace](#llaplace) distributions. The MLE scale is:
\begin{equation} \label{eq:bLogS}
    \hat{s} = \frac{1}{2T} \sum_{t=1}^T \sqrt{\left| \log(y_t) - \mu_t \right|} ,
\end{equation}
The shape of the density function of Log-S is similar to Log-Laplace but with even more extreme values:
```{r pdflogS, echo=FALSE}
plot(seq(0.01,5,0.01),ds(log(seq(0.01,5,0.01)),0,1)/seq(0.01,5,0.01),type="l",ylim=c(0,1.5),
     xlab="y_t",ylab="Density",main="PDF of Log-S distribution")
lines(seq(0.01,5,0.01),ds(log(seq(0.01,5,0.01)),0,2)/seq(0.01,5,0.01), col="blue")
lines(seq(0.01,5,0.01),ds(log(seq(0.01,5,0.01)),1,1)/seq(0.01,5,0.01), col="red")
legend("topright",legend=c("logS(0,1)","logS(0,2)","logS(1,1)"),
       lwd=1, col=c("black","blue","red"))
```

Estimating the model with Log-S distribution is equivalent to estimating the parameters of log-linear model:
\begin{equation*}
    \log y_t = \mu_t + \epsilon_t,
\end{equation*}
where $\epsilon_t \sim \mathcal{S}(0, \sigma^2)$. This distribution can be used for sever seldom right tail cases.

`alm()` with `distribution="dls"` uses `ds()` function with the logarithm of actual values, estimated mean $\mu_t$ and the scale \eqref{eq:sigmaLogLaplace}. The $\mu_t$ is returned in the variable `mu`, the $s$ is in the variable `scale`, while the `fitted.values` contains the exponent of $\mu_t$, which corresponds to median of distribution rather than mean. Finally, `resid()` method returns $e_t = \log y_t - \mu_t$.


### Log-Generalised Normal distribution {#lgnormal}
This is based on the exponent of [Generalised Normal distribution](#gnormal), giving the PDF:
\begin{equation} \label{eq:lgnormal}
    f(y_t) = \frac{\beta}{2s \Gamma(1/\beta)y_t}\exp\left(-\left(\frac{|\log(y_t) - \mu|}{s}\right)^\beta\right),
\end{equation}
The model implemented in the package has similarity with [Log-Normal](#lnormal), [Log-Laplace](#llaplace) and [Log-S](#lsvetunkov) distributions. The MLE scale is:
\begin{equation} \label{eq:LogAlpha}
    \hat{s} = \sqrt[^\beta]{\frac{\beta}{T} \sum_{t=1}^T \left| \log(y_t) - \mu_t \right|^{\beta}} .
\end{equation}
The shapes of the distribution depend on the value of parameters, giving it in some cases very long right tail:
```{r pdflogGN, echo=FALSE}
plot(seq(0.01,5,0.01),dgnorm(log(seq(0.01,5,0.01)),0,1,2)/seq(0.01,5,0.01),type="l",ylim=c(0,1.5),
     xlab="y_t",ylab="Density",main="PDF of Log-Generalised Normal distribution")
lines(seq(0.01,5,0.01),dgnorm(log(seq(0.01,5,0.01)),0,1,1)/seq(0.01,5,0.01), col="blue")
lines(seq(0.01,5,0.01),dgnorm(log(seq(0.01,5,0.01)),0,1,0.5)/seq(0.01,5,0.01), col="red")
lines(seq(0.01,5,0.01),dgnorm(log(seq(0.01,5,0.01)),0,1,100)/seq(0.01,5,0.01), col="purple")
legend("topright",legend=c("logGN(0,1,2)","logGN(0,1,1)","logGN(0,1,0.5)","logGN(0,1,100)"),
       lwd=1, col=c("black","blue","red","purple"))
```

Estimating the model with Log-Generalised Normal distribution is equivalent to estimating the parameters of log-linear model:
\begin{equation*}
    \log y_t = \mu_t + \epsilon_t,
\end{equation*}
where $\epsilon_t \sim \mathcal{GN}(0, s, \beta)$.

`alm()` with `distribution="dlgnorm"` uses the `dgnorm()` function from `greybox` package with the logarithm of actual values, estimated mean $\mu_t$, the scale \eqref{eq:sigmaLogLaplace} and either provided or estimated shape parameter $\beta$. The $\mu_t$ is returned in the variable `mu`, the $s$ is in the variable `scale` and $\beta$ is in `other$beta`, while the `fitted.values` contains the exponent of $\mu_t$, which corresponds to median of distribution rather than mean. Finally, `resid()` method returns $e_t = \log y_t - \mu_t$.


## Continuous distributions on a specific interval {#continuousInterval}
There is currently only one distribution in this group:

1. [Logit-normal distribution](#logitnormal).
2. [Beta distribution](#beta).

### Logit-normal distribution {#logitnormal}
A random variable follows Logit-normal distribution if its logistic transform follows normal distribution:
\begin{equation} \label{eq:logitFunction}
    z = \mathrm{logit}(y) = \log \left(\frac{y}{1-y}) \right),
\end{equation}
where $y\in (0,1)$, $y\sim \mathrm{logit}\mathcal{N}(\mu,\sigma^2)$ and $z\sim \mathcal{N}(\mu,\sigma^2)$. The bounds are not supported, because the variable $z$ becomes infinite. The density function of $y$ is:
\begin{equation} \label{eq:logitNormal}
    f(y_t) = \frac{1}{\sqrt{2 \pi \sigma^2} y_t (1-y_t)} \exp \left( -\frac{\left(\mathrm{logit}(y_t) - \mu_t \right)^2}{2 \sigma^2} \right) ,
\end{equation}
which has the following shapes:
```{r pdfLogitnorm, echo=FALSE}
plot(seq(0.01,0.99,0.01),dlogitnorm(seq(0.01,0.99,0.01),0,1),type="l",ylim=c(0,5),
     xlab="y_t",ylab="Density",main="PDF of Logit-Normal distribution")
lines(seq(0.01,0.99,0.01),dlogitnorm(seq(0.01,0.99,0.01),-1,1), col="blue")
lines(seq(0.01,0.99,0.01),dlogitnorm(seq(0.01,0.99,0.01),1,1), col="purple")
lines(seq(0.01,0.99,0.01),dlogitnorm(seq(0.01,0.99,0.01),0,3), col="red")
legend("topright",legend=c("logitN(0,1)","logitN(-1,1)","logitN(1,1)","logitN(0,3)"),
       lwd=1, col=c("black","blue","purple","red"))
```
Depending on the values of location and scale, the distribution can be either unimodal or bimodal and can be positively or negatively skewed. Because of its connection with normal distribution, the logit-normal has formulae for density, cumulative and quantile functions. However, the moment generation function does not have a closed form. 

The scale of the distribution can be estimated via the maximisation of likelihood and has some similarities with the scale in [Log-Normal](#lnormal) distribution:
\begin{equation} \label{eq:sigmaLogitNormal}
    \hat{\sigma}^2 = \frac{1}{T} \sum_{t=1}^T \left(\mathrm{logit}(y_t) - \mu_t \right)^2 .
\end{equation}

Estimating the model with Log-Normal distribution is equivalent to estimating the parameters of logit-linear model:
\begin{equation} \label{eq:logitLinearModel}
    \mathrm{logit}(y_t) = \mu_t + \epsilon_t,
\end{equation}
where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ or:
\begin{equation} \label{eq:logitLinearModelExp}
    y_t = \mathrm{logit}^{-1}(\mu_t + \epsilon_t),
\end{equation}
where $\mathrm{logit}^{-1}(z)=\frac{\exp(z)}{1+\exp(z)}$ is the inverse logistic transform.

`alm()` with `distribution="dlogitnorm"` does not transform the provided data and estimates the density directly using `dlogitnorm()` function from `greybox` package with the estimated mean $\mu_t$ and the variance \eqref{eq:sigmaLogitNormal}. The $\mu_t$ is returned in the variable `mu`, the $\sigma^2$ is in the variable `scale`, while the `fitted.values` contains the inverse logistic transform of $\mu_t$, which, given the connection between the Normal and Logit-Normal distributions, corresponds to median of distribution rather than mean. Finally, `resid()` method returns $e_t = \mathrm{logit}(y_t) - \mu_t$.


### Beta distribution {#beta}
Beta distribution is a distribution for a continuous variable that is defined on the interval of $(0, 1)$. Note that the bounds are not included here, because the probability density function is not well defined on them. If the provided data contains either zeroes or ones, the function will modify the values using:
\begin{equation} \label{eq:BetaWarning}
    y^\prime_t = y_t (1 - 2 \cdot 10^{-10}),
\end{equation}
and it will warn the user about this modification. This correction makes sure that there are no boundary values in the data, and it is quite artificial and needed for estimation purposes only.

The density function of Beta distribution has the form:
\begin{equation} \label{eq:Beta}
    f(y_t) = \frac{y_t^{\alpha_t-1}(1-y_t)^{\beta_t-1}}{B(\alpha_t, \beta_t)} ,
\end{equation}
where $\alpha_t$ is the first shape parameter and $\beta_t$ is the second one. Note indices for the both shape parameters. This is what makes the `alm()` implementation of Beta distribution different from any other. We assume that both of them have underlying deterministic models, so that:
\begin{equation} \label{eq:BetaAt}
    \alpha_t = \exp(x_t' A) ,
\end{equation}
and
\begin{equation} \label{eq:BetaBt}
    \beta_t = \exp(x_t' B),
\end{equation}
where $A$ and $B$ are the vectors of parameters for the respective shape variables. This allows the function to model any shapes depending on the values of exogenous variables. The conditional expectation of the model is calculated using:
\begin{equation} \label{eq:BetaExpectation}
    \hat{y}_t = \frac{\alpha_t}{\alpha_t + \beta_t} ,
\end{equation}
while the conditional variance is:
\begin{equation} \label{eq:BetaVariance}
    \text{V}({y}_t) = \frac{\alpha_t \beta_t}{((\alpha_t + \beta_t)^2 (\alpha_t + \beta_t + 1))} .
\end{equation}
Beta distribution has shapes similar to the ones of Logit-Normal one, but with shape parameters regulating respectively the left and right tails of the distribution:
```{r pdfBeta, echo=FALSE}
plot(seq(0.01,0.99,0.01),dbeta(seq(0.01,0.99,0.01),1,1),type="l",ylim=c(0,4),
     xlab="y_t",ylab="Density",main="PDF of Beta distribution")
lines(seq(0.01,0.99,0.01),dbeta(seq(0.01,0.99,0.01),0.1,1), col="blue")
lines(seq(0.01,0.99,0.01),dbeta(seq(0.01,0.99,0.01),1,0.1), col="purple")
lines(seq(0.01,0.99,0.01),dbeta(seq(0.01,0.99,0.01),2,2), col="red")
legend("topright",legend=c("Beta(1,1)","Beta(0.1,1)","Beta(1,0.1)","Beta(2,2)"),
       lwd=1, col=c("black","blue","purple","red"))
```

`alm()` function with `distribution="dbeta"` returns $\hat{y}_t$ in the variables `mu` and `fitted.values`, and $\text{V}({y}_t)$ in the `scale` variable. The shape parameters are returned in the respective variables `other$shape1` and `other$shape2`. You will notice that the output of the model contains twice more parameters than the number of variables in the model. This is because of the estimation of two models: $\alpha_t$ \eqref{eq:BetaAt} and $\beta_t$ \eqref{eq:BetaBt} - instead of one.

Respectively, when `predict()` function is used for the `alm` model with Beta distribution, the two models are used in order to produce predicted values for $\alpha_t$ and $\beta_t$. After that the conditional mean `mu` and conditional variance `variances` are produced using the formulae above. The prediction intervals are generated using `qbeta` function with the provided shape parameters for the holdout. As for the confidence intervals, they are produced assuming normality for the parameters of the model and using the estimate of the variance of the mean based on the `variances` (which is weird and probably wrong).


## Probability mass functions of discrete distributions {#discrete}
This group includes:

1. [Poisson distribution](#poisson),
2. [Negative Binomial distribution](#nbinom),
3. [Binomial distribution](#binom),
4. [Geometric distribution](#geom),

These distributions should be used in cases of count data.

### Poisson distribution {#poisson}
Poisson distribution used in ALM has the following standard probability mass function (PMF):
\begin{equation} \label{eq:Poisson}
    P(X=y_t) = \frac{\lambda_t^{y_t} \exp(-\lambda_t)}{y_t!},
\end{equation}
where $\lambda_t = \mu_t = \sigma^2_t = \exp(x_t' B)$. As it can be noticed, here we assume that the variance of the model varies in time and depends on the values of the exogenous variables, which is a specific case of heteroscedasticity. The exponent of $x_t' B$ is needed in order to avoid the negative values in $\lambda_t$.

Here are several examples of the PMF of Poisson with different values of parameters $\lambda$:
```{r pmfPoisson, echo=FALSE}
plot(seq(0,10,1),dpois(seq(0,10,1),0.1),type="b",ylim=c(0,1),
     xlab="y_t",ylab="Density",main="PMF of Poisson distribution")
par(new=TRUE)
plot(seq(0,10,1),dpois(seq(0,10,1),0.5),type="b",
     col="blue", ylim=c(0,1), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
plot(seq(0,10,1),dpois(seq(0,10,1),1),type="b",
     col="purple", ylim=c(0,1), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
plot(seq(0,10,1),dpois(seq(0,10,1),5),type="b",
     col="red", ylim=c(0,1), axes=FALSE, xlab="", ylab="")
legend("topright",legend=c("Poisson(0.1)","Poisson(0.5)","Poisson(1)","Poisson(5)"),
       lwd=1, col=c("black","blue","purple","red"))
```

`alm()` with `distribution="dpois"` returns `mu`, `fitted.values` and `scale` equal to $\lambda_t$. The quantiles of distribution in `predict()` method are generated using `qpois()` function from `stats` package. Finally, the returned residuals correspond to $y_t - \mu_t$, which is not really helpful or meaningful.

### Negative Binomial distribution {#nbinom}
Negative Binomial distribution implemented in `alm()` is parameterised in terms of mean and variance:
\begin{equation} \label{eq:NegBin}
    P(X=y_t) = \binom{y_t+\frac{\mu_t^2}{\sigma^2-\mu_t}}{y_t} \left( \frac{\sigma^2 - \mu_t}{\sigma^2} \right)^{y_t} \left( \frac{\mu_t}{\sigma^2} \right)^\frac{\mu_t^2}{\sigma^2 - \mu_t},
\end{equation}
where $\mu_t = \exp(x_t' B)$ and $\sigma^2$ is estimated separately in the optimisation process. These values are then used in the `dnbinom()` function in order to calculate the log-likelihood based on the distribution function.

Here are some examples of PMF of Negative Binomial distribution with different sizes and probabilities:
```{r pmfNegBin, echo=FALSE}
plot(seq(0,10,1),dnbinom(seq(0,10,1),3,0.5),type="b",ylim=c(0,0.4),
     xlab="y_t",ylab="Density",main="PMF of Negative Binomial distribution")
par(new=TRUE)
plot(seq(0,10,1),dnbinom(seq(0,10,1),5,0.5),type="b",
     col="blue", ylim=c(0,0.4), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
plot(seq(0,10,1),dnbinom(seq(0,10,1),5,0.75),type="b",
     col="purple", ylim=c(0,0.4), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
plot(seq(0,10,1),dnbinom(seq(0,10,1),5,0.25),type="b",
     col="red", ylim=c(0,0.4), axes=FALSE, xlab="", ylab="")
legend("topright",legend=c("NegBin(3,0.5)","NegBin(5,0.5)","NegBin(5,0.75)","NegBin(5,0.25)"),
       lwd=1, col=c("black","blue","purple","red"))
```

`alm()` with `distribution="dnbinom"` returns $\mu_t$ in `mu` and `fitted.values` and $\sigma^2$ in `scale`. The prediction intervals are produces using `qnbinom()` function. Similarly to Poisson distribution, `resid()` method returns $y_t - \mu_t$. The user can also provide `size` parameter in ellipsis if it is reasonable to assume that it is known.


### Binomial distribution {#binom}
The PMF of Binomial distribution is written as:
\begin{equation} \label{eq:Bin}
    P(X=y_t) = \binom{n}{y_t} p_t^y_t (1-p_t)^{n-y_t},
\end{equation}
where $n$ is the size parameter, $\mu_t=\exp(x_t' B)$ $p_t = \frac{1}{1+\mu_t}$ and $\mathrm{E}({y}_t) = n \times p_t$. The size parameter is always known and can be calculated as a number of unique values in $y$ minus one. So, if the data takes one of the three values: 0, 1 and 2, the size will be $n=2$. The values of $p_t$ and $n$ are then used in the `dbinom()` function to calculate the log-likelihood.

Visually, the PMF of the Binomial distribution has the following shapes with different values of probability and size:
```{r pmfBin, echo=FALSE}
plot(seq(0,10,1),dbinom(seq(0,10,1),3,0.5),type="b",ylim=c(0,0.4),
     xlab="y_t",ylab="Density",main="PMF of Binomial distribution")
par(new=TRUE)
plot(seq(0,10,1),dbinom(seq(0,10,1),5,0.5),type="b",
     col="blue", ylim=c(0,0.4), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
plot(seq(0,10,1),dbinom(seq(0,10,1),5,0.75),type="b",
     col="purple", ylim=c(0,0.4), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
plot(seq(0,10,1),dbinom(seq(0,10,1),5,0.25),type="b",
     col="red", ylim=c(0,0.4), axes=FALSE, xlab="", ylab="")
legend("topright",legend=c("Bin(3,0.5)","Bin(5,0.5)","Bin(5,0.75)","Bin(5,0.25)"),
       lwd=1, col=c("black","blue","purple","red"))
```

`alm()` with `distribution="dbinom"` returns $\mu_t$ in `mu` and $\mathrm{E}({y}_t)$ in `fitted.values`, the `scale` and `other$size` both contain the same value of $n$. The prediction intervals are produces using `qbinom()` function. `resid()` method returns $y_t - \mathrm{E}({y}_t)$.


### Geometric distribution {#geom}
There is also Geometric distribution implemented in `alm()`. It has the following PMF:
\begin{equation} \label{eq:Geom}
    P(X=y_t) = (1-p_t)^{y_t} p_t,
\end{equation}
where $p_t = \frac{1}{1+\exp(x_t' B)}$. The conditional expectation in this model is calculated as $\mu_t = \frac{1}{p_t} - 1 = \exp(x_t' B)$. The probability is used in the calculation of the log-likelihood via the `dgeom()` function.

Several PMFs of the Geometric distribution are shown in the following figure:
```{r pmfGeom, echo=FALSE}
plot(seq(0,10,1),dgeom(seq(0,10,1),0.25),type="b",ylim=c(0,1),
     xlab="y_t",ylab="Density",main="PMF of Geometric distribution")
par(new=TRUE)
plot(seq(0,10,1),dgeom(seq(0,10,1),0.5),type="b",
     col="blue", ylim=c(0,1), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
plot(seq(0,10,1),dgeom(seq(0,10,1),0.75),type="b",
     col="purple", ylim=c(0,1), axes=FALSE, xlab="", ylab="")
par(new=TRUE)
# plot(seq(0,10,1),dgeom(seq(0,10,1),0.9),type="b",
#      col="red", ylim=c(0,1), axes=FALSE, xlab="", ylab="")
legend("topright",legend=c("Geom(0.25)","Geom(0.5)","Geom(0.75)"),
       lwd=1, col=c("black","blue","purple"))
```

`alm()` with `distribution="dgeom"` returns $\mu_t$ in `mu` and in `fitted.values`. The `scale` does not contain anything useful, because the Geometric distribution has a time varying variance, which depends on the probability. The prediction intervals are produces using `qgeom()` function. Similarly to the other count distribution, `resid()` method returns $y_t - \mu_t$.


### An example of application
Round up the response variable for the next example:
```{r dataRound}
xreg[,1] <- round(abs(xreg[,1]))
inSample <- xreg[1:180,]
outSample <- xreg[-c(1:180),]
```
Negative Binomial distribution:
```{r negBinRegression}
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnbinom")
summary(ourModel)
```
And an example with predefined size:
```{r negBinRegressionWithSize}
ourModel <- alm(y~x1+x2, data=inSample, distribution="dnbinom", size=30)
summary(ourModel)
```

## Cumulative functions for binary variables {#cumulative}
The final class of models includes two cases:

1. [Logistic distribution (logit model)](#logit),
2. [Normal distribution (probit model)](#probit).

In both of them it is assumed that the response variable is binary and can be either zero or one. The main idea for this class of models is to use a transformation of the original data and link a continuous latent variable with the binary one. As a reminder, all the models eventually assume that:
\begin{equation} \label{eq:basicALMCumulative}
    \begin{matrix}
        o_t \sim \mathrm{Bernoulli}(p_t) \\
        p_t = g(x_t' A)
    \end{matrix},
\end{equation}
where $o_t$ is the binary response variable and $g(\cdot)$ is the cumulative distribution function. Given that we work with the probability of occurrence, the `predict()` method produces forecasts for the probability of occurrence rather than the binary variable itself. Finally, although many other cumulative distribution functions can be used for this transformation (e.g. `plaplace()` or `plnorm()`), the most popular ones are logistic and normal CDFs.

Given that the binary variable has Bernoulli distribution, its log-likelihood is:
\begin{equation} \label{eq:BernoulliLikelihood}
    \ell(p_t | o_t) = \sum_{o_t=1} \log p_t + \sum_{o_t=0} \log(1 - p_t),
\end{equation}
So the estimation of parameters for all the CDFs can be done maximising this likelihood.

In all the functions it is assumed that the probability $p_t$ corresponds to some sort of unobservable `level' $q_t = x_t' A$, and that there is no randomness in this level. So the aim of all the functions is to estimate correctly this level and then get an estimate of probability based on it.

The error of the model is calculated using the observed occurrence variable and the estimated probability $\hat{p}_t$. In a way, in this calculation we assume that $o_t=1$ happens mainly when the respective estimated probability $\hat{p}_t$ is very close to one. So, the error can be calculated as:
\begin{equation} \label{eq:BinaryError}
    u_t' = o_t - \hat{p}_t .
\end{equation}
However this error is not useful and should be somehow transformed into the original scale of $q_t$. Given that both $o_t \in (0, 1)$ and $\hat{p}_t \in (0, 1)$, the error will lie in $(-1, 1)$. We therefore standardise it so that it lies in the region of $(0, 1)$:
\begin{equation} \label{eq:BinaryErrorBounded}
    u_t = \frac{u_t' + 1}{2} = \frac{o_t - \hat{p}_t + 1}{2}.
\end{equation}

This transformation means that, when $o_t=\hat{p}_t$, then the error $u_t=0.5$, when $o_t=1$ and $\hat{p}_t=0$ then $u_t=1$ and finally, in the opposite case of $o_t=0$ and $\hat{p}_t=1$, $u_t=0$. After that this error is transformed using either Logistic or Normal quantile generation function into the scale of $q_t$, making sure that the case of $u_t=0.5$ corresponds to zero, the $u_t>0.5$ corresponds to the positive and $u_t<0.5$ corresponds to the negative errors. The distribution of the error term is unknown, but it is in general bimodal.

### Cumulative Logistic distribution {#logit}
We have previously discussed the density function of logistic distribution. The standardised cumulative distribution function used in `alm()` is:
\begin{equation} \label{eq:LogisticCDFALM}
    \hat{p}_t = \frac{1}{1+\exp(-\hat{q}_t)},
\end{equation}
where $\hat{q}_t = x_t' A$ is the conditional mean of the level, underlying the probability. This value is then used in the likelihood \eqref{eq:BernoulliLikelihood} in order to estimate the parameters of the model. The error term of the model is calculated using the formula:
\begin{equation} \label{eq:LogisticError}
    e_t = \log \left( \frac{u_t}{1 - u_t} \right) = \log \left( \frac{1 + o_t (1 + \exp(\hat{q}_t))}{1 + \exp(\hat{q}_t) (2 - o_t) - o_t} \right).
\end{equation}
This way the error varies from $-\infty$ to $\infty$ and is equal to zero, when $u_t=0.5$.
```{r cdfLogis, echo=FALSE}
plot(seq(-5,5,0.01),plogis(seq(-5,5,0.01),0,1),type="l",
     xlab="y_t",ylab="Density",main="CDF of Logistic distribution")
lines(seq(-5,5,0.01),plogis(seq(-5,5,0.01),-1,1), col="blue")
lines(seq(-5,5,0.01),plogis(seq(-5,5,0.01),1,1), col="purple")
lines(seq(-5,5,0.01),plogis(seq(-5,5,0.01),2,2), col="red")
legend("bottomright",legend=c("Logit(0,1)","Logit(-1,1)","Logit(1,1)","Logit(2,2)"),
       lwd=1, col=c("black","blue","purple","red"))
```

The `alm()` function with `distribution="plogis"` returns $q_t$ in `mu`, standard deviation, calculated using the respective errors \eqref{eq:LogisticError} in `scale` and the probability $\hat{p}_t$ based on \eqref{eq:LogisticCDFALM} in `fitted.values`. `resid()` method returns the errors discussed above. `predict()` method produces point forecasts and the intervals for the probability of occurrence. The intervals use the assumption of normality of the error term, generating respective quantiles (based on the estimated $q_t$ and variance of the error) and then transforming them into the scale of probability using Logistic CDF. *This method for intervals calculation is approximate and should not be considered as a final solution!*

### Cumulative Normal distribution {#probit}
The case of cumulative Normal distribution is quite similar to the cumulative Logistic one. The transformation is done using the standard normal CDF:
\begin{equation} \label{eq:NormalCDFALM}
    \hat{p}_t = \Phi(q_t) = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{q_t} \exp \left(-\frac{1}{2}x^2 \right) dx ,
\end{equation}
where $q_t = x_t' A$. Similarly to the Logistic CDF, the estimated probability is used in the likelihood \eqref{eq:BernoulliLikelihood} in order to estimate the parameters of the model. The error term is calculated using the standardised quantile function of Normal distribution:
\begin{equation} \label{eq:NormalError}
    e_t = \Phi \left(\frac{o_t - \hat{p}_t + 1}{2}\right)^{-1} .
\end{equation}
It acts similar to the error from the [Logistic distribution](#logit), but is based on the different set of functions. Its CDF has similar shapes to the logit:
```{r cdfNorm, echo=FALSE}
plot(seq(-5,5,0.01),pnorm(seq(-5,5,0.01),0,1),type="l",
     xlab="y_t",ylab="Density",main="CDF of Normal distribution")
lines(seq(-5,5,0.01),pnorm(seq(-5,5,0.01),-1,1), col="blue")
lines(seq(-5,5,0.01),pnorm(seq(-5,5,0.01),1,1), col="purple")
lines(seq(-5,5,0.01),pnorm(seq(-5,5,0.01),2,2), col="red")
legend("bottomright",legend=c("N(0,1)","N(-1,1)","N(1,1)","N(2,2)"),
       lwd=1, col=c("black","blue","purple","red"))
```

Similar to the Logistic CDF, the `alm()` function with `distribution="pnorm"` returns $q_t$ in `mu`, standard deviation, calculated based on the errors \eqref{eq:NormalError} in `scale` and the probability $\hat{p}_t$ based on \eqref{eq:NormalCDFALM} in `fitted.values`. `resid()` method returns the errors discussed above. `predict()` method produces point forecasts and the intervals for the probability of occurrence. *The intervals are also approximate and use the same principle as in Logistic CDF.*


## Mixture distribution models {#mixture}
Finally, mixture distribution models can be used in `alm()` by defining `distribution` and `occurrence` parameters. Currently only `plogis()` and `pnorm()` are supported for the occurrence variable, but all the other distributions discussed above can be used for the modelling of the non-zero values. If `occurrence="plogis"` or `occurrence="pnorm"`, then `alm()` is fit two times: first on the non-zero data only (defining the subset) and second - using the same data, substituting the response variable by the binary occurrence variable and specifying `distribution=occurrence`. As an alternative option, occurrence `alm()` model can be estimated separately and then provided as a variable in `occurrence`.

As an example of mixture model, let's generate some data:
```{r mixtureExampleData}
set.seed(42, kind="L'Ecuyer-CMRG")
xreg <- cbind(rnorm(200,10,3),rnorm(200,50,5))
xreg <- cbind(500+0.5*xreg[,1]-0.75*xreg[,2]+rs(200,0,3),xreg,rnorm(200,300,10))
colnames(xreg) <- c("y","x1","x2","Noise")
xreg[,1] <- round(exp(xreg[,1]-400) / (1 + exp(xreg[,1]-400)),0) * xreg[,1]
# Sometimes the generated data contains huge values
xreg[is.nan(xreg[,1]),1] <- 0;
inSample <- xreg[1:180,]
outSample <- xreg[-c(1:180),]
```

First, we estimate the occurrence model (it will complain that the response variable is not binary, but it will work):
```{r mixtureExampleOccurrence}
modelOccurrence <- alm(y~x1+x2+Noise, inSample, distribution="plogis")
```
And then use it for the mixture model:
```{r mixtureExampleFinal}
modelMixture <- alm(y~x1+x2+Noise, inSample, distribution="dlnorm", occurrence=modelOccurrence)
```
The occurrence model will be return in the respective variable:
```{r mixtureSummary}
summary(modelMixture)
summary(modelMixture$occurrence)
```
We can also do regression diagnostics using plots:
```{r mixtureDiagnostics}
par(mfcol=c(3,3))
plot(modelMixture, c(1:9))
```

After that we can produce forecasts using the data from the holdout sample (in this example we also ask for several confidence levels):
```{r mixturePredict, eval=FALSE}
predict(modelMixture,outSample,interval="p",level=c(0.8,0.9,0.95))
```

If you expect autoregressive elements in the data, then you can specify the order of ARIMA via the respective parameter (note that the MA part is not supported yet. Use `adam` function from `smooth` package for that):

```{r mixtureExampleFinalAR}
modelMixtureAR <- alm(y~x1+x2+Noise, inSample, distribution="dlnorm", occurrence=modelOccurrence, orders=c(1,0,0))
summary(modelMixtureAR)
plot(predict(modelMixtureAR, outSample, interval="p", side="u"))
```

If the explanatory variables are not available for the holdout sample, the `forecast()` function can be used:

```{r mixtureExampleFinalARForecast}
plot(forecast(modelMixtureAR, h=10, interval="p", side="u"))
```

It will produce forecasts for each of the explanatory variables based on the available data using `es()` function from `smooth` package (if it is available; otherwise, it will use Naive) and use those values as the new data.

Alternatively, a user might know when demand occurs in future and can provide the vector of zeroes and ones so that function takes them into account properly:
```{r eval=FALSE}
forecast(modelMixtureAR, h=10, interval="p", side="u",
         occurrence=c(0,1,0,1,1,1,0,0,0,1))
```
Similarly, a vector of zeroes and ones can be provided in `occurrence` variable in `alm()` to let function know that the occurrence is not stochastic (e.g. zeroes in the data appear because we do not sell products over weekends).


# Scale model {#scaleModel}
Finally, `alm()` supports scale model, estimating the scale of assumed distribution via the specified set of variables (this is inspired by GAMLSS model). This might be handy if you suspect that some variables cause heteroscedasticity. However, this can only be done if likelihood approach is used for the model estimation and is done via `scale` parameter, which can accept either previously estimated model (e.g. via `sm()` method) or a formula for the model. While the details will differ from one distribution to another, the main idea will be the same, so here we discuss this on example of [Normal distribution](#normal), for which the model becomes:
\begin{equation} \label{eq:scaleModel}
    y_t \sim \mathcal{N}\left(\mathbf{x}_t' \mathbf{b}, \exp\left(\mathbf{z}_t' \mathbf{c}\right) \right),
\end{equation}
where $\mathbf{x}_t$ and $\mathbf{B}$ are defined [as above](#normal), $\mathbf{z}_t$ is the vector of explanatory variables for the scale and $\mathbf{c}$ is the vector of parameters for this part of model. The exponent in the formula above is needed to make sure that the resulting values are always positive. The resulting value $\hat{\sigma}_t^2 = \exp\left(\mathbf{z}_t' \mathbf{c}\right)$ will be the variance of the residuals conditional on the values of explanatory variables. The same model can be rewritten as (in case of normal distribution):
\begin{equation} \label{eq:scaleModel2}
    y_t = \mathbf{x}_t' \mathbf{b} + \sqrt{\exp\left(\mathbf{z}_t' \mathbf{c}\right)} \epsilon_t,
\end{equation}
where $\epsilon_t \sim \mathcal{N}\left(0, 1 \right)$. Given that the location and scale of distribution are independent, we can construct the complete model in two steps: first by estimating the parameters of location and then the parameters of scale. The parameters are initialised based on the residuals of the model and then are optimised via maximisation of likelihood of the specified probability density function. This is done using the generic method `sm()` after constructing a model. For example, you can estimate a model based on `lm()` and then construct scale model for it (`sm()` will assume Normal distribution in this case and will use likelihood):
```{r}
locationModel <- lm(mpg~., mtcars)
scaleModel <- sm(locationModel,~qsec+wt)
```
In `sm()`, the estimation starts with a least squares of logarithms of squared residuals on the selected explanatory variables. The obtained parameters are then used in the optimiser (using `nloptr()` function) and are modified to maximise the likelihood function of the Normal distribution.

The function returns the object of class `scale`, which contains several potentially useful variables, including the fitted value ($\hat{\sigma}^2_t = \exp\left(\mathbf{z}_t' \mathbf{c}\right)$) and residuals ($\epsilon_t$). The formula for residuals would differ from one distribution to another. The residuals can be used for model diagnostics to see if the potential problems of the location model have been resolved (e.g. heteroscedasticity removed). The resulting model supports several methods. Here is an example:
```{r}
summary(scaleModel)
```

The diagnostic plots can be produced as well. The error term in this case would be standardised.

If the scale model needs to be taken into account in forecasting, then the `implant()` method can be used to implant it into the previously estimated model. This only works with `alm()` function:
```{r}
locationModel <- alm(mpg~., mtcars)
scaleModel <- sm(locationModel,~qsec+wt)
locationModel <- implant(locationModel,scaleModel)
```

When it comes to `alm()`, the model can be estimated in the same function:

```{r}
almModel <- alm(mpg~., mtcars[-c(1:3),], scale=~qsec+wt)
```

In this case, the summary of the model would differ from the basic call of `alm()` - it would include the output for the scale model together with the location one:

```{r}
summary(almModel)
```

Forecasting of the overall model (with location and scale) can be done in the same was as for the basic model:
```{r eval=FALSE}
plot(predict(almModel,mtcars[1:3,],interval="p",level=0.95))
```

If the prediction interval is required, the function would first produce forecasts of the scale, and then use them for the construction of interval via the standard procedure, depending on the assumed distribution.

As mentioned earlier, depending on the distribution, we would have slight differences in the function `sm()`, with different transformations of residuals of the location model (needed for the initialisation) and transformations of scale depending on the used distribution. Here is the full list of these:

1. [Normal](#normal), [Log-Normal](#lnormal), [Boc-Cox Normal](#bcnormal), [Logit Normal](#logitnormal), [Folded Normal](#fnormal) and [Logistic distribution](#logis). Logarithms of squared residuals are used in the initialisation, $\hat{\sigma}_t$ is used in the density function;
2. [Laplace](#laplace), [Log-Laplace](#llaplace) and [Asymmetric Laplace](#alaplace) distributions. Logarithms of absolute residuals are used in the initialisation, $\hat{\sigma}_t$ is used in the density function;
3. [S](#svetunkov) and [Log-S](#lsvetunkov) distributions. Logarithms of square root absolute residuals are used in the initialisation, $\hat{\sigma}_t^2$ is used in the density function;
4. [Generalised Normal](#gnormal) and [Log-Generalised Normal](#lgnormal) distributions. Logarithms of absolute residuals in the power of $\beta$ are used in the initialisation, $\hat{\sigma}_t^{\frac{1}{\beta}}$ is used in the density function;
5. [Gamma distribution](#gamma). Logarithms of squared residuals minus one ($\log(e_t-1)^2$) are used in the initialisation. This is needed in order to get to the estimate of the variance in the residuals, as the distribution assumes in `alm()` that $\mathrm{E}(e_t)=1$. $\hat{\sigma}_t$ is used in the density function;
6. Inverse Gaussian distribution. The residuals are transformed according to the formula: $\log\left(\frac{(e_t-1)^2}{e_t}\right)$, which (similarly to Gamma distribution) is motivated by the assumption of $\mathrm{E}(e_t)=1$. $\hat{\sigma}_t$ is used in the density function.
7. All the other distributions rely on logarithms of absolute residuals and use $\hat{\sigma}_t$. This does not necessarily have a good reason and is done for fail-safe purposes.

Note that the scale model is a relatively new feature, so it might not produce perfect results. Let me know if you have ideas of how to improve it via [https://github.com/config-i1/greybox/issues](https://github.com/config-i1/greybox/issues).


# Using different loss functions {#losses}
There are several loss functions implemented in the function and there is an option for a user to provide their own. If the `loss` is not `"likelihood"`, then the distribution is only needed for inference. Keep in mind that this typically means that the likelihood is not maximised, so the inference might be wrong and the results can be misleading. However, there are several cases, when this is not the case:

1. `loss="MSE"`, `distribution=c("dnorm","dlnorm","dbcnorm","dlogitnorm")`;
2. `loss="MAE"`, `distribution=c("dlaplace","dllaplace")`;
3. `loss="HAM"`, `distribution=c("ds","dls")`;

The likelihoods of the distributions above are maximised, when the respective losses are minimised, so all the inference discussed above hold in these three situations. In all the other cases, the `alm()` function will not return `logLik` and might complain that the derivations can be misleading.

Also, when `loss="likelihood"` (and for the three cases above) the number of estimated parameters includes the scale of distribution (e.g. standard deviation in case of Normal distribution), but it is not counted in the number of parameters of all the other losses.

Finally, we can provide custom loss functions. Here is an example:
```{r}
lossFunction <- function(actual, fitted, B, xreg){
  return(mean(abs(actual-fitted)^3));
}
modelLossCustom <- alm(y~x1+x2+Noise, inSample, distribution="dnorm", loss=lossFunction)
summary(modelLossCustom)
```

Keep in mind that it is important to have parameters `actual`, `fitted`, `B` and `xreg` in the custom loss function, even if they are not used. `xreg` is the matrix of the expanded explanatory variables (including the intercept in the first column if it is needed).

You might also notice that there are LASSO and RIDGE available as options for `loss` parameter. These are experimental. When they are used, the explanatory variables are not normalised, but the parameters (all but intercept) are divided by the standard deviations of explanatory variables, which has a similar (but not the same) effect. Also, in order for this to work appropriately, *you need to provide* the parameter `lambda`, which should lie between 0 and 1, where `lambda=0` means that there is no shrinkage and `lambda=1` implies that there is only shrinkage, and the MSE is not used at all. The MSE part of the loss is divided by $\mathrm{V}\left({\Delta}y_t\right)$, where ${\Delta}y_t=y_t-y_{t-1}$, bringing it closer to the scale of parameters and hopefully making $\lambda$ a bit more meaningful.

Note that due to the formulation of `alm()`, you can use LASSO for [Logistic](#logit) (`distribution="plogis"`), [Poisson](#poisson) (`distribution="dpois"`) and [Negative Binomial](#nbinom) (`distribution="dnbinom"`) regressions as well. Keep in mind that in order to get better results, you might need to tune [the parameters of the optimiser](#estimator) (e.g. set `maxeval` to a higher value or increase `ftol_rel`).

When working with non-conventional loss functions do not rely on the default standard errors of parameters, because they are not what they seem: they only work in case of `loss="likelihood"`. A correct method for calculating the standard errors is the bootstrap, which is now available via the `bootstrap=TRUE` parameter. Here is an example:

```{r}
summary(modelLossCustom, bootstrap=TRUE, nsim=100)
```
Where `nsim` is the parameter passed to the method `coefbootstrap()`, which is used in order to generate coefficients of models and then extract the necessary statistics. See `?coefbootstrap` to see what parameters are accepted by the function. There are several methods that can use `coefbootstrap()` and will use bootstrapped covariance matrix: `vcov()`, `coefint()`, `summary()`, `predict()` and `forecast()`.


# Estimation process {#estimator}
There are several parameters in the optimiser that can be regulated by the user. Here is the list:

1. `B` - the vector of starting values of parameters for the optimiser, which should correspond to the ordering of the explanatory variables. In order to see the order of parameters, you can fit a model to the data and extract `B` from it via `ourModel$B`;
2. `algorithm` the algorithm to use in optimisation (`"NLOPT_LN_SBPLX"` by default).
3. `maxeval` - maximum number of iterations to carry out (default is 200 for simpler models, 500 for [Logistic](#logit), [Probit](#probit), [Poisson](#poisson) and [Negative Binomial](#nbinom) and 1000 in case of `loss=c("LASSO","RIDGE)`);
4. `maxtime` - stop, when the optimisation time (in seconds) exceeds this;
5. `xtol_rel` -  the relative precision of the optimiser (the default is 1E-6);
6. `xtol_abs` - the absolute precision of the optimiser (the default is 1E-8);
7. `ftol_rel` - the stopping criterion in case of the relative change in the loss function (the default is 1E-4);
8. `ftol_abs` - the stopping criterion in case of the absolute change in the loss function (the default is 0 - not used);
9. `print_level` - the level of output for the optimiser (0 by default). If equal to 41, then the detailed results of the optimisation are returned.

You can read more about these parameters by running the function `nloptr.print.options()` form the `nloptr` package. Typically, if your model does not work as expected, you might need to tune some of these parameters in order to make sure that the optimum of the loss is reached to the satisfactory extent.

# References