---
title: Bayesian Poisson Mixtures with the Telescoping Sampler
author: Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, Bettina Grün
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Bayesian Poisson Mixtures with the Telescoping Sampler}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: telescope.bib 
---

# Introduction

In this vignette we fit a Bayesian Poisson mixture with a prior on the number of components $K$ to a univariate data set of counts, the eye data. We use the prior specification and the telescoping sampler for performing MCMC sampling as proposed in @Fruehwirth-Schnatter+Malsiner-Walli+Gruen:2021.

First, we load the package.

```{r}
library("telescope")
```

# The eye data set

The eye data set was previously analyzed in @Escobar+West:1998 and @Fruehwirth+Malsiner-Walli:2019. We directly insert the values into R.

```{r}
y <- c(34, 24, 22, 17, 17, 15, 15, 14, 12, 12, 11, 11, 10, 10, 9, 9,
       8, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0)
N <- length(y)
```

The data set is visualized using a barplot. 

```{r fig.height = 5, fig.width = 7}
barplot(table(y), col = "lightgray", xlab = "count", ylab = "Frequency")   
```

# Model specification

For univariate observations  $y_1,\ldots,y_N$ the following model with hierarchical prior structure  is assumed:

\begin{aligned}
y_i \sim \sum_{k=1}^K \eta_k Pois(\mu_k)&\\
K \sim p(K)&\\
\boldsymbol{\eta} \sim Dir(e_0)& \qquad \text{with } e_0 \text{ fixed or } e_0\sim p(e_0) \text { or } e_0=\frac{\alpha}{K}, \text{ with }   \alpha \text{ fixed or } \alpha \sim p(\alpha)\\
\mu_k \sim \mathcal{G}(a_0,b_0)&  \qquad \text{with }E(\mu_k) = a_0/b_0,\\
b_0 \sim \mathcal{G}(h_0,H_0)& \qquad \text{with }E(b_0) = h_0/H_0.
\end{aligned}


# Specification of the simulation and prior parameters 

For MCMC sampling we need to specify `Mmax`, the maximum number of iterations, `thin`, the thinning imposed to reduce auto-correlation in the chain by only recording every `thin`ed observation, and `burnin`, the number of burn-in iterations not recorded.

```{r}
Mmax <- 20000
thin <- 1
burnin <- 1000
```

The specifications of `Mmax` and `thin` imply `M`, the number of recorded observations.

```{r}
M <- Mmax/thin
```

We specify with `Kmax` the maximum number of components possible during sampling. `Kinit` denotes the initial number of filled components. 

```{r}
Kmax <- 50  
Kinit <- 10
```

We fit a dynamic specification on the weights.
```{r}
G <- "MixDynamic"
```
For a static specific one would need to use `"MixStatic"`.

For the dynamic setup, we specify the the gamma prior $\mathcal{G}(1,2)$ on `alpha`, with $E(\alpha)=1/2$.
```{r}
priorOnAlpha <- priorOnAlpha_spec("gam_1_2")
```

We need to select the prior on `K`. We use the prior $K-1 \sim BNB(1, 4, 3)$ as suggested in @Fruehwirth-Schnatter+Malsiner-Walli+Gruen:2021 for a model-based clustering context.

```{r}
priorOnK <- priorOnK_spec("BNB_143")
```
  
Now, we specify the hyperparameters of the  prior on the component-specific rate $\mu_k$.
```{r}
a0 <- 0.1 
h0 <- 0.5 
b0 <- a0/mean(y) 
H0 <- h0/b0
```

We need initial parameters and an initial classification $S_0$ of the observations. 

```{r}
set.seed(1234) 
cl_y <- kmeans(y, centers = Kinit, nstart = 30)
S_0 <- cl_y$cluster
mu_0 <- t(cl_y$centers)
eta_0 <- rep(1/Kinit, Kinit)
```

# MCMC sampling

Using this prior specification as well as initialization and MCMC settings, we draw samples from the posterior using the telescoping sampler. 

The first argument of the sampling function is the data followed by the initial partition and the initial parameter values for component-specific rate and weights. The next argument corresponds to the hyperparameter  of the prior setup (`a0`, `b0`, `h0`, `H0`). Then the setting for the MCMC sampling are specified using `M`, `burnin`, `thin` and `Kmax`. Finally the prior specification for the weights and the prior on the number of components are given (`G`, `priorOnK`, `priorOnAlpha`).

```{r}
estGibbs <- samplePoisMixture(
    y, S_0, mu_0, eta_0, 
    a0, b0, h0, H0,
    M, burnin, thin, Kmax, 
    G, priorOnK, priorOnAlpha)
```

The sampling function returns a named list where the sampled parameters and latent variables are contained. The list includes the component rates `Mu`, the weights `Eta`, the assignments `S`, the number of observations  `Nk` assigned to components, the number of components `K`, the number of filled components `Kplus`, parameter values corresponding to the mode of the nonnormalized posterior `nonnormpost_mode`, the acceptance rate in the MH step when sampling $\alpha$ or $e_0$,  $\alpha$ and $e_0$. These values can be extracted for post-processing.

```{r}
Mu <- estGibbs$Mu
Eta <- estGibbs$Eta
S <- estGibbs$S
Nk <- estGibbs$Nk
K <- estGibbs$K
Kplus <- estGibbs$Kplus
nonnormpost_mode <- estGibbs$nonnormpost_mode
acc <- estGibbs$acc
e0 <- estGibbs$e0
alpha <- estGibbs$alpha 
```

# Convergence diagnostics of the run

We inspect the acceptance rate when sampling $\alpha$ and the trace plot of the sampled $\alpha$:
```{r fig.height = 5, fig.width = 7}
sum(acc)/M
plot(1:length(alpha), alpha,
     type = "l", ylim = c(0, max(alpha)),
     xlab = "iterations", ylab = expression(alpha))
```

In addition a histogram is also created. 
```{r fig.height = 5, fig.width = 7}
hist(alpha, freq = FALSE, breaks = 50)
```

We also characterize the distribution of $\alpha$ using location metrics and quantiles:
```r{}
mean(alpha)
median(alpha)
quantile(alpha, probs = c(0.25, 0.5, 0.75))
```

The trace plot of the induced hyperparameter $e_0$ is also visualized:
```{r fig.height = 5, fig.width = 7}
plot(1:length(e0), e0,
     type = "l", ylim = c(0, max(e0)),
     xlab = "iterations", ylab = expression(e["0"]))
hist(e0, freq = FALSE, breaks = 50)
mean(e0)
quantile(e0, probs = c(0.25, 0.5, 0.75))
```

To further assess convergence, we also inspect the trace plots of the number of components $K$ and the number of filled components $K_+$.
```{r fig.height = 5, fig.width = 7}
plot(1:length(K), K, type = "l", ylim = c(0, 30), 
     xlab = "iteration", main = "", ylab = "number",
     lwd = 0.5, col = "grey")
points(1:length(K), Kplus, type = "l", col = "red3",
       lwd = 2, lty = 1)
legend("topright", legend = c("K", expression(K["+"])),
       col = c("grey", "red3"), lwd = 2)
```

# Identification of the mixture model 

Clustering the data is not the main purpose for the eye data set, but rather capturing the heterogeneity in the rate parameter through a mixture. Aiming to identify the mixture model by determining a suitable number of data clusters and obtaining an identified model for this number using the clustering procedure in the point process representation fails indicating that the fitted model does not represent a suitable model for clustering. 

## Step 1: Estimate $K_+$ and $K$

We determine the posterior distribution of the number of filled components $K_+$ approximated using the telescoping sampler.
We visualize the distribution using a barplot.

```{r fig.height = 5, fig.width = 7}
Kplus <- rowSums(Nk != 0, na.rm = TRUE)  
p_Kplus <- tabulate(Kplus, nbins = max(Kplus))/M  
barplot(p_Kplus/sum(p_Kplus), names = 1:length(p_Kplus), 
        col = "red3", xlab = expression(K["+"]),
        ylab = expression("p(" ~ K["+"] ~ "|" ~ bold(y) ~ ")"))
```

The distribution of $K_+$ is also characterized using the 1st and 3rd quartile as well as the median.

```{r}
quantile(Kplus, probs = c(0.25, 0.5, 0.75))
```

We obtain a point estimate for $K_+$ by taking the mode and determine the number of MCMC draws where exactly $K_+$ components were filled.

```{r}
Kplus_hat <- which.max(p_Kplus)
Kplus_hat
M0 <- sum(Kplus == Kplus_hat)
M0          
```
We also determine the posterior distribution of the number of components $K$ directly drawn using the telescoping sampler.

```{r}
p_K <- tabulate(K, nbins = max(K, na.rm = TRUE))/M
round(p_K[1:20], digits = 2)
```

```{r fig.height = 5, fig.width = 7}
barplot(p_K/sum(p_K), names = 1:length(p_K), xlab = "K", 
        ylab = expression("p(" ~ K ~ "|" ~ bold(y) ~ ")"))
```

Again the posterior mode can be determined as well as the posterior
mean and quantiles of the posterior.

```{r}
which.max(tabulate(K, nbins = max(K)))
mean(K)
quantile(K, probs = c(0.25, 0.5, 0.75))
```

## Step 2: Extract those draws which correspond to samples with exactly $\hat{K}_+$ filled components

First, we select those draws where the number of non-empty groups was exactly $\hat{K}_+$.

```{r}
index <- Kplus == Kplus_hat
Nk[is.na(Nk)] <- 0
Nk_Kplus <- (Nk[index, ] > 0)
```

In the following we extract the cluster-specific rates, the data cluster sizes and the cluster assignments for the draws where exactly $\hat{K}_+$ components were filled.
```{r}
Mu_inter <- Mu[index, , , drop = FALSE]
Mu_Kplus <- array(0, dim = c(M0, 1, Kplus_hat)) 
Mu_Kplus[, 1, ] <- Mu_inter[, 1, ][Nk_Kplus]

Eta_inter <- Eta[index, ]
Eta_Kplus <- matrix(Eta_inter[Nk_Kplus], ncol = Kplus_hat)
Eta_Kplus <- sweep(Eta_Kplus, 1, rowSums(Eta_Kplus), "/")

w <- which(index)
S_Kplus <- matrix(0, M0, length(y))
for (i in seq_along(w)) {
    m <- w[i]
    perm_S <- rep(0, Kmax)
    perm_S[Nk[m, ] != 0] <- 1:Kplus_hat
    S_Kplus[i, ] <- perm_S[S[m, ]]
}
```

## Step 3: Clustering and relabeling of the MCMC draws, with $\hat{K}_+$ being the estimated number of data clusters

For model identification, we cluster the draws of the rates where exactly $\hat{K}_+$ components were filled in the point process representation using $k$-means clustering. 
```{r}
Func_init <- matrix(nonnormpost_mode[[Kplus_hat]]$mean_muk,
                    nrow = Kplus_hat)
identified_Kplus <- identifyMixture(
    Mu_Kplus, Mu_Kplus, Eta_Kplus, S_Kplus, Func_init)
identified_Kplus$non_perm_rate
```

The non-permutation rate is 1. This means that in no iteration  a unique assignment of draws  to components could be made, indicating a strong overfit of the number of clusters.

## References