--- title: "Variable Length Markov Chains with Covariates (COVLMC)" output: rmarkdown::html_vignette: df_print: kable vignette: > %\VignetteIndexEntry{Variable Length Markov Chains with Covariates (COVLMC)} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 7, fig.align = "center" ) ``` ```{r setup} library(mixvlmc) library(ggplot2) ``` ## Limitations of Variable Length Markov Chains Variable Length Markov Chains (VLMC) are very useful to capture complex structures in discrete time series as they can mix short memory with long memory in a contextual way, leading to sparse models. However, they cannot capture the influence of exogenous variables on the behaviour of a time series. As a consequence, a VLMC adjusted on sequences influenced by covariates could fail to capture interesting patterns. ### A typical example Let us focus again on the electrical usage data set used in the package introduction, using the following two weeks of data: ```{r active_power_week_15_16, fig.height=5} pc_week_15_16 <- powerconsumption[powerconsumption$week %in% c(15, 16), ] elec <- pc_week_15_16$active_power ggplot(pc_week_15_16, aes(x = date_time, y = active_power)) + geom_line() + xlab("Date") + ylab("Activer power (kW)") ``` We build again a discrete time series using the following thresholds: - low active power at night (typically below 0.4 kW); - standard use between 0.4 and 2 kW; - peak use above 2 kW. ```{r} elec_dts <- cut(elec, breaks = c(0, 0.4, 2, 8), labels = c("low", "typical", "high")) ``` Adjusting a VLMC to this sequence gives the following model (using BIC for model selection): ```{r} elec_vlmc_tune <- tune_vlmc(elec_dts) best_elec_vlmc <- as_vlmc(elec_vlmc_tune) draw(best_elec_vlmc) ``` Considering the apparent regularity of the time series, one may wonder whether this model is really capturing the dynamics of power electricity consumption. In particular the longest memory of 3 time steps, i.e., 30 minutes, may seem a bit short compared to the somewhat long periods of stability. ## Theoretical aspects VLMC with covariates have been introduced in [Variable length Markov chain with exogenous covariates](https://doi.org/10.1111/jtsa.12615). The core idea is to enable the conditional probabilities of the next state given the context to depend on exogenous covariates. ### Variable memory A COVLMC models a pair of sequences. The main sequence is denoted $\mathbf{X}=X_1, X_2, \ldots, X_n, \ldots$. The random variables take values in a finite state space $S$ exactly as in a standard VLMC. We have *in addition* another sequence of random variables $\mathbf{Y}=Y_1, Y_2, \ldots, Y_n, \ldots$ which take values in $\mathbb{R}^p$. This latter assumption $Y_l\in \mathbb{R}^p$ is can be relaxed to more general spaces. A COVLMC puts restriction on the conditional probabilities of $X_n$ given the past of *both* sequences. To simplify the presentation, we denote $X^k_m$ the sequence of random variables $$ X_k^m=(X_k, X_{k-1}, \ldots, X_m), $$ and we use similar notations for the values taken by the variables $x^k_m$, as well as for $Y^k_m$ and $y^k_m$. The notation accommodates both temporal order if $km$. A pair of sequences $\mathbf{X}$ and $\mathbf{Y}$ is a COVLMC if there is a maximal order $l_{\max}$ and a function $l$ from $S^{l_{\max}}$ to $\{0,\ldots,l_{\max}\}$ such that for all $n>l_{\max}$ $$ \begin{multline} \mathbb{P}(X_n=x_n\mid X^{n-1}_1=x^{n-1}_1, Y^{n-1}_1=y^{n-1}_1)=\\ \mathbb{P}\left(X_n=x_n\mid X^{n-1}_{n-l\left(x^{n-1}_{n-l_{\max}}\right)}=x^{n-1}_{n-l\left(x^{n-1}_{n-l_{\max}}\right)}, Y^{n-1}_{n-l\left(x^{n-1}_{n-l_{\max}}\right)}=y^{n-1}_{n-l\left(x^{n-1}_{n-l_{\max}}\right)}\right). \end{multline} $$ As in VLMC, the $\mathbf{X}$ process has a finite and variable memory, but this memory applies to both $\mathbf{X}$ and $\mathbf{Y}$. Notice that the memory order depends only on $\mathbf{X}$ and that no assumptions are made on the temporal behaviour of $\mathbf{Y}$. Thus a COVLMC on $\mathbf{X}$ and $\mathbf{Y}$ is not a VLMC on the pair $(\mathbf{X}, \mathbf{Y})$ (which can make sense if $\mathbf{Y}$ is discrete). As for VLMC, the memory length function generates a *context* function $c$ which keeps in the past the part needed to obtain the conditional distribution: $c$ is a function from $S^{l_{\max}}$ to $\bigcup_{k=0}^{l_{\max}}S^k$ given by $$ c(x^{n-1}_{n-l_{\max}})=x^{n-1}_{n-l\left(x^{n-1}_{n-l_{\max}}\right)} $$ The image by $c$ of $S^{l_{\max}}$ is the set of contexts of the COVLMC which is entirely specified by $l$ and one conditional distribution by unique context. As the notations are somewhat opaque at first, we can illustrate the definition with a simple example. We consider a binary sequence (values in $S=\{0, 1\}$) and a single numerical covariate $Y_t\in\mathbb{R}$. As in the theoretical example in `vignette("variable-length-markov-chains")` we assume that $l_{\max}=3$ and that $l$ is given by $$ \begin{align*} l(a, b, 0)&=1&\forall a, \forall b,\\ l(c, 1, 1)&=2&\forall c,\\ l(0, 0, 1)&=3,&\\ l(1, 0, 1)&=3.&\\ \end{align*} $$ In practice the COVLMC is therefore fully described by specifying the following probabilities: $$ \begin{align*} \mathbb{P}(X_t=1\mid& X_{t-1}=0, Y_{t-1}=y_{t-1})\\ \mathbb{P}(X_t=1\mid& X_{t-1}=1, X_{t-2}=1, Y_{t-1}=y_{t-1}, Y_{t-2}=y_{t-2})\\ \mathbb{P}(X_t=1\mid& X_{t-1}=1, X_{t-2}=0, X_{t-3}=0, Y_{t-1}=y_{t-1}, Y_{t-2}=y_{t-2}, Y_{t-3}=y_{t-3})\\ \mathbb{P}(X_t=1\mid& X_{t-1}=1, X_{t-2}=0, X_{t-3}=1, Y_{t-1}=y_{t-1}, Y_{t-2}=y_{t-2}, Y_{t-3}=y_{t-3}) \end{align*} $$ ### Conditional distributions The main difficulty induced by COVLMC compared to VLMC is the specification of the conditional distributions. Indeed the conditional distributions depend on $\mathbf{Y}$ and cannot simply be given by a table. For instance, in the example above, we need to specify, among others, $\mathbb{P}(X_t=1\mid X_{t-1}=0, Y_{t_1}=y_{t-1})$, for each all values of $y_{t-1}\in \mathbb{R}$ (in $\mathbb{R}^p$ in the general case). A natural choice in this particular example is to use a logistic model, i.e. to assume $$ \mathbb{P}(X_t=1\mid X_{t-1}=0, Y_{t_1}=y_{t-1})=g(\alpha^0+y_{t-1}\beta_1^0), $$ with $g(t)=\frac{1}{1+\exp(-t)}$ is the logistic function. The superscript on $\alpha^0$ and $\beta^0_1$ refer to the context, here $0$, while the subscript on $\beta^0_1$ refers to the time delay (here $1$). By extension, we have for example $$ \begin{multline*} \mathbb{P}(X_t=1\mid X_{t-1}=1, X_{t-2}=1, Y_{t_1}=y_{t-1}, Y_{t_2}=y_{t-2})=\\ g\left(\alpha^{1,1}+y_{t-1}\beta^{1,1}_1+y_{t-2}\beta^{1,1}_2\right). \end{multline*} $$ More generally, the probability distribution associated to a context could be given by any function that maps the values of the covariates to a distribution on $S$, the state space. Following the original [paper](https://doi.org/10.1111/jtsa.12615), `mixvlmc` uses multinomial logistic regression as implemented by `VGAM::vglm()` or `nnet::multinom()`, or a logistic regression provided by `stats::glm()` for state spaces with only 2 states. This has several advantages over a more general solution: 1) during the estimation phase, one probability distribution is estimated for each relevant context: this could induce a large computational burden for more complex models than multinomial logistic ones; 2) having a simple model enables to fit contexts with limited number of occurrences which is important to allow searching for long term dependencies; 3) logistic models with different memory order are easy to compare using a likelihood-ratio test. The last point is used in `mixvlmc` (as in the original paper) to simplify local models with respect to the covariates. For a context of length $l$, the probability distribution is assumed to depend on the $l$ past values of $\mathbf{Y}$ but in practice we allow a dependency to only $k= 7 & pc_week_15_16$hour <= 18)) ``` and fit a COVLMC with ```{r} elec_tune <- tune_covlmc(elec_dts, elec_cov) elec_tune ``` The model selection process can be represented graphically as follows: ```{r fig.height=4} ggplot(elec_tune$results, aes(x = alpha, y = BIC)) + geom_line() + geom_point() ``` or automatically using e.g. `autoplot()` as follows: ```{r} print(autoplot(elec_tune)) ``` The final model is: ```{r} draw(as_covlmc(elec_tune), model = "full", p_value = FALSE, with_state = TRUE) ``` It shows some interesting patterns: - one of the logistic model is degenerate: in the `high` context, no transition to a `low` context can happen. This was already observed with the VLMC; - the only dependency with respect to the covariate is in the context typical, typical, typical. In this case, the probability to stay in the typical context is increased during day time and decreased (but to a lesser extent) during the night. In practice, this means that the model is able to generate longer sequences that stay in the typical state during the day than at night. Notice finally that for this real world example, the `min_size` parameter has some influence on the results. Setting it to a smaller value does not change the final model, as shown below: ```{r} elec_tune_3 <- tune_covlmc(elec_dts, elec_cov, min_size = 3) elec_tune_3 ``` However, increasing the parameter to, e.g., 10 generates a simpler but weaker model as show here: ```{r} elec_tune_10 <- tune_covlmc(elec_dts, elec_cov, min_size = 10) elec_tune_10 ``` ### Diagnostics The package provides numerous ways to analyse a COVLMC. asic functions include - `states()` returns the state space of the model; - `depth()` returns the length of the longest context in the model; - `covariate_depth()` returns the longest memory used by the model with respect to covariates; - `context_number()` returns the number of contexts in the model. For instance, the large model obtained above has the following characteristics: ```{r} elec_model <- as_covlmc(elec_tune) states(elec_model) depth(elec_model) covariate_depth(elec_model) context_number(elec_model) ``` VLMC objects support classical statistical functions such as: ```{r} logLik(elec_model) AIC(elec_model) BIC(elec_model) ``` ### Contexts The model can be explored in details by drawing its context tree (see `vignette("context-trees")` for details) as follows: ```{r} draw(elec_model) ``` The `draw.covlmc()` function is more advanced than its VLMC counterpart. It provides more detailed information, particularly regarding p-values associated with pruning operations. The "merging" p-value corresponds to replacing some of the contexts with a single joint model, while the "collapsing" p-value is associated to pruning all the sub-contexts of a context. P-values associated to specific models correspond to reducing the memory of the corresponding model, that is discarding the dependency of the conditional probability towards the oldest covariates. To explore the contexts in a programmatic way, one should rely on the `contexts()` function. COVLMC contexts have additional characteristics compared to VLMC and context trees. In particular, the `contexts()` function can report the model associated to each context, either by its parameters only: ```{r} contexts(elec_model, model = "coef") ``` or using the models themselves: ```{r} contexts(elec_model, model = "full") ``` As for context trees, focused analysis of specific contexts can be done by requesting a `ctx_node_covlmc` representation of the context of interest, for instance via the `find_sequence.covlmc()` function, or with the default result type of `contexts()` ```{r} ctxs <- contexts(elec_model) ctxs ``` Individual contexts can be analysed using a collection of dedicated functions, for instance ```{r} model(ctxs[[2]]) model(ctxs[[3]], type = "full") ``` See `contexts.covlmc()` for details.