In this vignette we present additional features of the
sfclust
package.
We begin by loading the required packages.
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
The simulated dataset used in this vignette, stgaus
, is
included in our package. It is a stars
object with one
variable, y
, and two dimensions: geometry
and
time
. The dataset represents the number of cases in 100
regions, observed daily over 91 days, starting in January 2024.
#> stars object with 2 dimensions and 1 attribute
#> attribute(s):
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> y -1.170275 -0.2390084 0.0302046 -0.0004274323 0.2095926 1.370593
#> dimension(s):
#> from to offset delta refsys point
#> geometry 1 100 NA NA NA FALSE
#> time 1 91 2024-01-01 1 days Date FALSE
#> values
#> geometry POLYGON ((59.5033 683.285...,...,POLYGON ((942.7562 116.89...
#> time NULL
To gain some initial insights, we can aggregate the data weekly:
stweekly <- aggregate(stgaus, by = "week", FUN = mean)
ggplot() +
geom_stars(aes(fill = y), stweekly) +
facet_wrap(~ time) +
scale_fill_distiller(palette = "RdBu") +
theme_bw()
We can also examine the trends for each region:
stgaus |>
st_set_dimensions("geometry", values = 1:nrow(stgaus)) |>
as_tibble() |>
ggplot() +
geom_line(aes(time, y, group = geometry, color = factor(geometry)), linewidth = 0.3) +
theme_bw() +
theme(legend.position = "none")
Some regions exhibit similar trends over time, but the overall patterns are more complex than polinomial functions. Our goal is to cluster these regions while accounting for spatial contiguity.
We assume that the response variable \(Y_{it}\) for region \(i\) at time \(t\) follows a normal distribution given the partition \(M\): \[ Y_{it} \mid \mu_{it}, M \stackrel{ind}{\sim} \text{Normal}(\mu_{it}, \sigma^2), \] where the mean \(\mu_{it}\) is modeled based on the cluster assignment \(c_i\): \[ \mu_{it} = \alpha_{c_i} + f_{c_i,t}, \] where \(\alpha_{c_i}\) is the intercept for cluster \(c_i\), and \(f_{c_i,t}\) is a latent random effect modeled as a random walk process. Specifically, we impose the condition: \[ f_{c_i,t} - f_{c_i,t-1} \sim N(0, \nu^{-1}), \quad \text{for } t = 2, \dots, n. \]
The prior for the hyperparameter \(\nu_c\) is defined as: \[ \log(\nu_c) \sim \text{Normal}(-2, 1). \]
sfclust
uses an undirected graph to represent
connections between regions and proposes spatial clusters using this
graph through minimum spanning trees (MST). By default,
sfclust
accepts the argument graphdata
, which
should include:
igraph
object representing spatial
connections,membership
vector indicating the initial cluster
assignments for each region.For simplicity, you can use the genclust
function to
generate an initial random partitioning with a specified number of
clusters. In this example, we create a partition with 20 clusters:
#> [1] "graph" "mst" "membership"
Now, let’s visualize how the regions were randomly clustered:
st_sf(st_geometry(stgaus), cluster = factor(initial_cluster$membership)) |>
ggplot() +
geom_sf(aes(fill = cluster)) +
theme_bw()
sfclust
We will initiate the Bayesian clustering algorithm using our
generated partition (initial_cluster
). Additionally, we use
the capabilites of INLA::inla
to define a model that
includes a random walk process in the linear predictor, and custom
priors for the scale hyperparameter. To begin, we will run the algorithm
for only 50 iterations.
result <- sfclust(stgaus, graphdata = initial_cluster, niter = 50,
formula = y ~ f(idt, model = "rw1", hyper = list(prec = list(prior = "normal", param = c(-2, 1)))))
result
#> Within-cluster formula:
#> y ~ f(idt, model = "rw1", hyper = list(prec = list(prior = "normal",
#> param = c(-2, 1))))
#>
#> Clustering hyperparameters:
#> q birth death change hyper
#> 0.500 0.425 0.425 0.100 0.050
#>
#> Clustering movement counts:
#> births deaths changes hypers
#> 11 6 3 3
#>
#> Log marginal likelihood (sample 50 out of 50): 2268.291
The output indicates that after starting with 20 clusters, the algorithm created 11 new clusters (births), removed 6 clusters (deaths), changed the membership of 3 clusters, and modified the minimum spanning tree (MST) 3 times.
The plot
method allows us to select which graph to
produce. For example, we can visualize only the log marginal likelihood
to diagnose convergence. With just 50 iterations, we can see that the
log marginal likelihood has not yet achieved convergence.
To achieve convergence, we can continue the sampling using the
update
function and specify the number of new iterations
with the niter
argument:
#> Within-cluster formula:
#> y ~ f(idt, model = "rw1", hyper = list(prec = list(prior = "normal",
#> param = c(-2, 1))))
#>
#> Clustering hyperparameters:
#> q birth death change hyper
#> 0.500 0.425 0.425 0.100 0.050
#>
#> Clustering movement counts:
#> births deaths changes hypers
#> 35 50 9 88
#>
#> Log marginal likelihood (sample 2000 out of 2000): 12708.35
With 2000 additional iterations, there have been many clustering movements. Furthermore, when visualizing the results, we can observe that the log marginal likelihood has achieved convergence.
The final iteration indicates that the algorithm identified 10 clusters, with a log marginal likelihood of 12,708.35. The largest cluster consists of 27 regions, while the smallest contains only one region.
#> Summary for clustering sample 2000 out of 2000
#>
#> Within-cluster formula:
#> y ~ f(idt, model = "rw1", hyper = list(prec = list(prior = "normal",
#> param = c(-2, 1))))
#>
#> Counts per cluster:
#> 1 2 3 4 5 6 7 8 9 10
#> 27 20 12 11 9 9 6 3 2 1
#>
#> Log marginal likelihood: 12708.35
Let’s visualize the regions grouped by cluster.
stgaus$cluster <- fitted(result2, sort = TRUE)$cluster
st_sf(st_geometry(stgaus), cluster = factor(stgaus$cluster)) |>
ggplot() +
geom_sf(aes(fill = cluster)) +
theme_bw()
#> Warning in data.frame(..., check.names = FALSE): row names were found from a
#> short variable and have been discarded
Let’s visualize the original data grouped by cluster.
stgaus$cluster <- fitted(result2, sort = TRUE)$cluster
stgaus |>
st_set_dimensions("geometry", values = 1:nrow(stgaus)) |>
as_tibble() |>
ggplot() +
geom_line(aes(time, y, group = geometry), linewidth = 0.3) +
facet_wrap(~ cluster, ncol = 2) +
theme_bw() +
labs(title = "Risk per cluster", y = "Risk", x = "Time")