--- title: "Exact distribution of excursions height" author: "Sébastien Déjean - Charly Marti - Sabine Mercier - Sebastian Simon - David Robelin" date: "`r Sys.Date()`" output: pdf_document: toc: yes html_document: df_print: paged toc: yes vignette: > %\VignetteIndexEntry{Exact distribution of excursions height} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r, include = FALSE} knitr::opts_chunk$set( warning = FALSE, collapse = TRUE, comment = "#>" ) ``` ```{r setup, echo=FALSE} library(localScore) ``` # Introduction The goal of this part of the package is to calculate the theoretical probability that the $i$-th excursion reach the threshold score $a$, for a Markov chain with a known transition matrix and a given score function. The main function for this objective is `proba_theoretical_ith_excursion_markov()`. This section of the localScore package is the direct implementation of the pre publication: "Exact distribution of excursion heights of the Lindley process in a Markovian model" written by Carlos Cortés Rojas, Simona Grusea and Sabine Mercier. ## Computation method The main goal of this function is to calculate the probability that the first excursion of the Lindley sequence associated to the studied sequence is greater or equal to $a$, conditionally to $\alpha$ a potential beginning of this sequence. It also computes the transition probability matrix of the beginnings of excursions: `matrix_M`. The product of this matrix, elevated to the power $i-1$, with the first vector of the probabilities conditionally to $\alpha$, gives the probabilities conditionally to $\alpha$ for the $i$-th excursion. To return the global probability, the function multiplies the conditional vector with the distribution of the first letter of the sequence. ## Definition of a mathematical excursion The number of an excursion is given by the mathematical definition of Karlin and Altschul (1990). Its corresponds to the number of record times of the Lindley sequence associated to the score sequence $(X_k)_{1\leq k\leq n}$ and recursively defined as follows: $$T_0:=0\qquad \mbox{and }\qquad T_{(k+1)}:=\inf\{i\geq T_k+1, \sum_{j=T_k+1}^i X_j<0\}\ .$$ Please note that the sum $\sum_{j=T_k+1}^i X_j$ must be non positive. If it is equal to zero, it will mathematically be still the same Lindley excursion. $$(A_k)_k\quad \mbox{with}\quad T_{(i-1)}+1\leq k 0 ``` Let us consider a study on the first and the last sub optimal excursions in the sequential order of the sequence. First excursion height equal to 40; last one, the 77th one, height equal to 6. ```{r} subOptSegment["1",] #First excursion subOptSegment[as.character(dim(subOptSegment)[1]),] #Last excursion ``` Let us compute their $p$-values. ```{r} theta <- letters[1:length(score_values)] # arbitrary score_function <- score_values # defined earlier a <- 40 i <- 1 system.time(pv2<-proba_theoretical_ith_excursion_markov(a, theta, lambda, score_function,i)$proba_q_i_geq_a) pv2 a <- 6 i <- 77 system.time(pv3<-proba_theoretical_ith_excursion_markov(a, theta, lambda, score_function, i)$proba_q_i_geq_a) pv3 ``` First excursion : height equal to 40 with $p$-value=$0.45\%$; Last excursion, the 77th one, height equal to 6 and $p$-value=$17\%$. The time computation of the first $p$-value is larger because of the larger value of $a$. We can consider that from the 20th mountain we reached the stationary distribution for the beginning of excursion. We shall therefore take a lower value for $i$ for the last excursion to evaluate the difference. Let us try $i=20$ instead of 77. ```{r} a <- 6 i <- 20 system.time(pv4 <- proba_theoretical_ith_excursion_markov(a, theta, lambda, score_function,i)$proba_q_i_geq_a) pv4 i <- 10 system.time(pv5 <- proba_theoretical_ith_excursion_markov(a, theta, lambda, score_function,i)$proba_q_i_geq_a) pv5 ``` We obtain the same value as expected even for $i=10$. ## With a reverse lecture of the protein As the lecture of a protein could be done in both direction, we also consider the reverse sequence. ```{r} LongSeqScore.inv <- rev(LongSeqScore) localScoreC(LongSeqScore.inv) head(sort(localScoreC(LongSeqScore.inv)$suboptimalSegmentScores[,1], decreasing = TRUE)) plot(1:1093,lindley(LongSeqScore.inv), type = 'l') ``` That leads to: first excursion equal to 6; last excursion, the 96th one, equal to 38. The corresponding $p$-values are: ```{r} markov_parameters <- sequences2transmatrix(LongSeqScore.inv) lambda.inv <- markov_parameters$transition_matrix system.time(pv5<-proba_theoretical_ith_excursion_markov(a = 38, theta, lambda.inv, score_function, i = 96 )$proba_q_i_geq_a) pv5 proba_theoretical_ith_excursion_markov(a = 6, theta, lambda.inv, score_function,i = 1 )$proba_q_i_geq_a ``` Last excursion, the 96th one, equal to 38, with $p$-value equal to $0.53\%$. First excursion equal to 6, with $p$-value equal to $17\%$. The last excursion is still significant. The first one is still non significant. Remark : Even with a Bonferroni correction to take into account the multiple test (here two studies excursions), the sequence possesses a significant segment. Whereas considering the highest value over all the excursions of the whole sequence, the local score value, the sequence is not significant. ## What about the excursion realising the local score Let us consider the excursion 30 as an excursion among the others. For the first way to read the protein: ```{r} a <- 65 i <- 30 proba_theoretical_ith_excursion_markov(a, theta, lambda, score_function, i)$proba_q_i_geq_a ``` There is a less than 4 in 10,000 chance of having a first mountain over 65. Or with the second way to read the protein: ```{r} subOptSegment.inv <- localScoreC(LongSeqScore.inv)$suboptimalSegmentScores which.max(subOptSegment.inv[,1]) print(subOptSegment.inv[16,]) a <- 65 i <- 16 proba_theoretical_ith_excursion_markov(a, theta, lambda.inv, score_function,i)$proba_q_i_geq_a ``` As expected we obtain a similar $p$-value. Although the highest mountain of height 65 in this sequence of length 1093 does not have a significant value at the $5\%$ threshold, it does not mean that the sequence has not interesting segments. Observing a mountain exceeding 40 is significant for example, and it is the same for values 36 and 33. All these tests remain significant even if a correction of type Bonferroni is taken.