`stgam`

The aim of this vignette is provide an introduction to the
`stgam`

package and demonstrates how to construct a spatially
vary coefficient (SVC) model and spatial and temporally varying (STVC)
model using the `stgam`

package. Essentially
`stgam`

provides a wrapper for constructing GAMs with
Gaussian Process (GP) smooths or splines, that are parameterised with
location for SVCs and with location and time for STVCs. The key ideas
underpinning the development of SVCs with GAMs in the `stgam`

package and this vignette are:

- Standard linear regression assumes that predictor-to-response relationships to be the same throughout the study area.
- This is often not the case when when location is considered, for
example of outliers.

- Many geographic processes have a Gaussian form when they are examined over 2-dimensional space, as they essentially exhibit distance decay.
- The GAMs can include smooths or splines of different forms. One such form is a Gaussian Process (GP) spline.
- GP splines can be specified to model non-linearity (wiggliness) over geographic space if location is included with the covariate.
- A GAM with GP splines parameterised by location - a Geographic GP GAM or GGP-GAM - defines a spatially varying coefficient (SVC) model.
- This can be extended to time and space-time.

The approach is presented in outline below, but detail of the SVCs,
TVCs and STVCs constructed using GAMs with GP smooths, and the evolution
of their application from spatial models to space-time coefficient
modelling can be found in Comber, Harris, and
Brunsdon (2024) and Comber et al.
(submitted). In this, the concept of a Gaussian Process (Wood 2020) is important in the context of
regression modelling. It provides a *data model* of the
likelihood that a given data set is generated given a statistical model
involving some unknown parameters and in regression modelling, the
unknown parameters are the regression parameters. These are described
formally in Comber, Harris, and Brunsdon
(2024) and Comber et al.
(submitted).

One final comment is that in this vignette you will all of the
varying coefficient models you create assume that spatial, temporal or
spatio-temporal dependencies are present in the data. This may not be
the case and these assumptions are examined in detail in the second
vignette in the `stgam`

package.

In this vignette you will:

- Create and interpret a simple SVC using GAMs with GP smooths.
- Create and interpret a simple TVC GAM.
- Extend this to a GP GAM that includes both space and time in smooths to define a STVC model.
- Be encourages to reflect on how space and time interact and the
assumptions embedded in the model specification. Be warned! the Vignette
only uses one function from the
`stgam`

package!

You should install the `stgam`

package either from CRAN or
from GitHub:

And then make sure the required packages and data are loaded:

```
# load the packages
library(stgam)
library(cols4all) # for nice shading in graphs and maps
library(cowplot) # for managing plots
library(dplyr) # for data manipulation
library(ggplot2) # for plotting and mapping
library(glue) # for model construction
library(mgcv) # for GAMs
library(sf) # for spatial data
library(doParallel) # for parallelising operations
library(purrr) # for model construction
library(tidyr) # for model construction
# load the data
data(productivity)
data(us_data)
```

The `productivity`

data is annual economic productivity
data for the 48 contiguous US states (with Washington DC merged into
Maryland), for years 1970 to 1985 (17 years). This was extracted from
the `plm`

package (Croissant, Millo,
and Tappe 2022). The `us_data`

is a spatial dataset of
the US states in a USA Contiguous Equidistant Conic projection
(ESRI:102005) from the `spData`

package (Bivand, Nowosad, and Lovelace 2019). The
`productivity`

data includes locational information of the
state geometric centroid in the same projection. The code below maps the
`X`

and `Y`

locations in `productivity`

along with the US state areas.

```
ggplot() + geom_sf(data = us_data, fill = NA) +
geom_point(data = productivity |> filter(year == "1970"), aes(x = X, y = Y)) +
theme_bw() + ylab("") + xlab("")
```

The data attributes can be examined and a spatial model of
constructed using `gam`

function from the `mgcv`

package. The varying coefficient models created below all focus on
Private capital stock (`privC`

) as the target variable, with
Unemployment (`unemp`

) and Public capital (`pubC`

)
covariates, where the coefficient functions are assumed to be
realisation of a Gaussian Process (GP) introduced above.

```
head(productivity)
#> # A tibble: 6 × 14
#> state GEOID year region pubC hwy water util privC gsp emp unemp X
#> <chr> <chr> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 ALABAMA 01 1970 6 15033. 7326. 1656. 6051. 35794. 28418 1010. 4.7 858146.
#> 2 ALABAMA 01 1971 6 15502. 7526. 1721. 6255. 37300. 29375 1022. 5.2 858146.
#> 3 ALABAMA 01 1972 6 15972. 7765. 1765. 6442. 38670. 31303 1072. 4.7 858146.
#> 4 ALABAMA 01 1973 6 16406. 7908. 1742. 6756. 40084. 33430 1136. 3.9 858146.
#> 5 ALABAMA 01 1974 6 16763. 8026. 1735. 7002. 42057. 33749 1170. 5.5 858146.
#> 6 ALABAMA 01 1975 6 17316. 8158. 1752. 7406. 43972. 33604 1155. 7.7 858146.
#> # ℹ 1 more variable: Y <dbl>
```

A spatially varying coefficient model will be created using the
`productivity`

dataset.

The code below defines an intercept column (`Intercept`

)
in the data. This to allow the intercept to be treated as an addressable
term in the model. It also defines parametric and non-parametric forms
for the intercept and each covariate, so that they can can take a global
form (i.e. as in a standard OLS regression) and a spatially varying form
in the GP smooth.

```
# define intercept term
productivity <- productivity |> mutate(Intercept = 1)
# create the SVC
svc.gam = gam(privC ~ 0 +
Intercept + s(X, Y, bs = 'gp', by = Intercept) +
unemp + s(X, Y, bs = "gp", by = unemp) +
pubC + s(X, Y, bs = "gp", by = pubC),
data = productivity |> filter(year == "1970"))
```

Notice the `0 +`

in the model. This indicates that the
intercept coefficient is not included implicitly and it is included
explicitly as `Intercept`

. Also notice the different form of
the splines from those specified in Part 1. Here, for each covariate, a
GP smooth is specified for `X`

and `Y`

(the
coordinates in geographic space) and the covariate is included via the
`by`

parameter. This is to explore the interaction of the
covariate with the target variable over the space defined by
`X`

and `Y`

locations, allowing spatially varying
coefficients to be generated. The model has 4 key terms specified in the
same way by
`<VAR> + (X, Y, b s= 'gp', by = <VAR>)`

:

- the
`<VAR>`

is the fixed parametric term for the covariate `s(...)`

defines the smooth`bs = 'gp'`

states that this is a GP smooth`by = <VAR>`

suggests the GP should be multiplied by variable

The model output can be assessed: the `k'`

and
`edf`

parameters are quite close, but the p-values are high
and and the `k-index`

is greater than 1 so this looks OK. The
diagnostic plots are again generated by the `gam.check`

function as below:

```
#>
#> Method: GCV Optimizer: magic
#> Smoothing parameter selection converged after 6 iterations by steepest
#> descent step failure.
#> The RMS GCV score gradient at convergence was 2452.878 .
#> The Hessian was not positive definite.
#> Model rank = 8 / 101
#>
#> Basis dimension (k) checking results. Low p-value (k-index<1) may
#> indicate that k is too low, especially if edf is close to k'.
#>
#> k' edf k-index p-value
#> s(X,Y):Intercept 32.00 2.00 0.82 0.065 .
#> s(X,Y):unemp 33.00 2.00 0.82 0.010 **
#> s(X,Y):pubC 33.00 3.67 0.82 0.035 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# model summary
summary(svc.gam)
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> privC ~ 0 + Intercept + s(X, Y, bs = "gp", by = Intercept) +
#> unemp + s(X, Y, bs = "gp", by = unemp) + pubC + s(X, Y, bs = "gp",
#> by = pubC)
#>
#> Parametric coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> Intercept 0.0014245 0.0003432 4.151 0.000168 ***
#> unemp 0.0064540 0.0015548 4.151 0.000168 ***
#> pubC -3.9243042 0.9571325 -4.100 0.000196 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df F p-value
#> s(X,Y):Intercept 2.000 2.000 0.453 0.639
#> s(X,Y):unemp 2.000 2.000 0.405 0.676
#> s(X,Y):pubC 3.672 3.715 188.593 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Rank: 8/101
#> R-sq.(adj) = 0.918 Deviance explained = 96.5%
#> GCV = 1.9338e+08 Scale est. = 1.613e+08 n = 48
```

Here it can be seen that:

- The model is well tuned: all of the all effective degrees of freedom
(
`edf`

are well below`k`

in the`gam.check()`

printed output (the`k-index<1`

issue is not important because of this). - All of the the fixed parametric terms are significant.
- Of the smooth terms, only
`pccap`

is locally significant and spatially varying.

The spatially varying coefficient estimates can be extracted using
`predict`

. To do this a dummy data set is created with the
`pubC`

term set to 1, and the intersect and
`unemp`

terms set to zero. The result is that the predicted
values for the coefficient estimate are just a function of \(\beta_2\), the `pubC`

coefficient estimate at each location.

```
get_b2<- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 0, pubC = 1)
res <- productivity |> filter(year == "1970") |>
mutate(b2 = predict(svc.gam, newdata = get_b2))
```

The resulting `data.frame`

called `res`

has a
new variable called `b2`

which is the spatially varying
coefficient estimate for `pubC`

. For comparison, we can
generate the spatially varying coefficient estimate for the intercept
(\(\beta_0\)) and `unemp`

\(\beta_1\) (which were not found to be
significant locally) in the same way by setting the other terms in the
model to zero:

```
get_b0 <- productivity |> filter(year == "1970") |> mutate(Intercept = 1, unemp = 0, pubC = 0)
res <- res |> mutate(b0 = predict(svc.gam, newdata = get_b0))
get_b1 <- productivity |> filter(year == "1970") |> mutate(Intercept = 0, unemp = 1, pubC = 0)
res <- res |> mutate(b1 = predict(svc.gam, newdata = get_b1))
```

So `res`

has the records for the year 1970 and three new
columns for `b0`

, `b1`

and `b2`

The
distribution of the spatially varying coefficient estimates can be
examined:

```
res |> select(b0, b1, b2) |>
apply(2, summary) |> round(2)
#> b0 b1 b2
#> Min. -19058.71 -2848.15 -0.18
#> 1st Qu. -7882.46 -1199.14 1.29
#> Median 731.22 -156.64 2.11
#> Mean 0.00 0.30 2.08
#> 3rd Qu. 7891.22 1179.57 2.85
#> Max. 20699.80 3362.89 3.80
```

The `stgam`

package has a function to extract the varying
coefficients, `calculate_vcs`

. This takes three arguments,
the GAM varying coefficient model, the model terms, and the data used to
create the model:

```
terms = c("Intercept", "unemp", "pubC")
res <- calculate_vcs(model = svc.gam,
terms = terms,
input_data = productivity |> filter(year == "1970"))
summary(res[, paste0("b_",terms)])
#> b_Intercept b_unemp b_pubC
#> Min. :-19058.713 Min. :-2848.148 Min. :-0.1822
#> 1st Qu.: -7882.457 1st Qu.:-1199.142 1st Qu.: 1.2941
#> Median : 731.222 Median : -156.638 Median : 2.1100
#> Mean : 0.001 Mean : 0.302 Mean : 2.0825
#> 3rd Qu.: 7891.224 3rd Qu.: 1179.568 3rd Qu.: 2.8474
#> Max. : 20699.803 Max. : 3362.893 Max. : 3.8034
```

Standard `dplyr`

and `ggplot`

approaches can be
used to join and map the coefficient estimates, formally \(\beta_0\), \(\beta_1\) and \(\beta_2\)) as in the figure below. Notice
the North-South trend for the Intercept and the East-West and trend for
Unemployment - both insignificant predictors of `privC`

- and
a much stronger specific spatial pattern between the target variable and
Public capital (`pubC`

), with particularity high coefficient
estimates in the south.

```
# join the data
map_results <-
us_data |> left_join(res |> select(GEOID, b_Intercept, b_unemp, b_pubC), by = "GEOID")
# plot the insignificant coefficient estimates
tit =expression(paste(""*beta[0]*""))
p1 <-
ggplot(data = map_results, aes(fill=b_Intercept)) +
geom_sf() +
scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) +
coord_sf() +
ggtitle("Intercept: not significant")
tit =expression(paste(""*beta[1]*""))
p2 <-
ggplot(data = map_results, aes(fill=b_unemp)) +
geom_sf() +
scale_fill_continuous_c4a_div(palette="brewer.spectral",name=tit) +
coord_sf() +
ggtitle("Unemployment: not significant")
# plot the significant pubC coefficient estimates
tit =expression(paste(""*beta[2]*" "))
p3 <-
ggplot(data = map_results, aes(fill=b_pubC)) +
geom_sf() +
scale_fill_continuous_c4a_div(palette="brewer.prgn",name=tit) +
coord_sf() +
ggtitle("Public captial: significant")
plot_grid(p1, p2, p3, ncol = 1)
```

The `productivity`

data was filtered for 1970 the the SVC
above in which the `X-Y`

location of the 48 states was used
to parametrise the Gaussian Process smooths. The same structure can be
used to create a temporally vary coefficient model (TVC), with smooths
specified to include the `year`

parameter, but this time not
restricting the analysis data to records from a single year:

```
# create the TVC
tvc.gam = gam(privC ~ 0 +
Intercept + s(year, bs = 'gp', by = Intercept) +
unemp + s(year, bs = "gp", by = unemp) +
pubC + s(year, bs = "gp", by = pubC),
data = productivity)
```

The model can be inspected in the same way to examine the the
`k'`

and `edf`

parameters using the
`gam.check`

function and again there are no concerns:

The model summary indicates that all of the parametric and temporally vary coefficients are significant at the 95% level except the parametric one for Unemployment, but there are some interesting patterns in the residuals, as reflected in the diagnostics plots above and the model fit (\(R^2\)) value.

The temporally varying coefficients can be extracted in the same way
as the DVC approach, explored using the `predict`

function
and setting each of the covariates to 1 and the others to zero in
turn:

```
terms = c("Intercept", "unemp", "pubC")
res <- calculate_vcs(model = tvc.gam,
terms = terms,
input_data = productivity)
summary(res[, paste0("b_",terms)])
#> b_Intercept b_unemp b_pubC
#> Min. :-1057 Min. :-3733.9 Min. :1.462
#> 1st Qu.: 7649 1st Qu.:-2342.3 1st Qu.:1.653
#> Median :16356 Median : -950.6 Median :1.844
#> Mean :16356 Mean : -950.6 Mean :1.844
#> 3rd Qu.:25063 3rd Qu.: 441.0 3rd Qu.:2.035
#> Max. :33769 Max. : 1832.6 Max. :2.226
```

The variation in coefficient estimates can be inspected over time, remembering that each State has the same coefficient estimate value for each year, so just one state is selected in the code below (you could chose another and the result would be the same). Notice the linear declines due to the linearity of the time covariate in the figure below.

```
res |>
filter(state == "ARIZONA") |>
select(year, b_Intercept, b_unemp, b_pubC) |>
pivot_longer(-year) |>
ggplot(aes(x = year, y = value)) +
geom_point() + geom_line() +
facet_wrap(~name, scale = "free") +
theme_bw() + xlab("Year") + ylab("")
```

We can combine space and time in GAM GP splines. But how? We could
use separate smooths for location and for time, or a single, 3D smooth
parameterised with both location and time. There are assumptions
associated with each of these. The code below specifies the interaction
of the covariates within a single space-time GP smooth. You will notice
this takes a few seconds longer to run, and choice of how to specify the
smooths is explored in the second vignette in the `stgam`

package.

```
stvc.gam = gam(privC ~ 0 +
Intercept + s(X, Y, year, bs = 'gp', by = Intercept) +
unemp + s(X, Y, year, bs = "gp", by = unemp) +
pubC + s(X, Y, year, bs = "gp", by = pubC),
data = productivity)
```

The model can be inspected in the usual way using the
`gam.check`

function (again, no concerns) and the
`summary`

function with \(k\) and \(edf\) despite the concerning
`k-index`

and `p-value`

values. In this case all
of the parametric and smooth coefficient estimates are significant at
the 95% level, and the model fit (\(R^2\)) has again increased over the SVC
model.

The coefficients spatial and temporally vary coefficients can be extracted in the same way as before and the variation in coefficient estimates from the STVC-GAM model summarised:

```
terms = c("Intercept", "unemp", "pubC")
res <- calculate_vcs(model = stvc.gam,
terms = terms,
input_data = productivity)
summary(res[, paste0("b_",terms)])
#> b_Intercept b_unemp b_pubC
#> Min. :-9767.27 Min. :-1581.313 Min. :0.009162
#> 1st Qu.:-3966.23 1st Qu.: -642.095 1st Qu.:1.619839
#> Median : -77.85 Median : -312.564 Median :2.298574
#> Mean : 0.00 Mean : -2.411 Mean :2.308127
#> 3rd Qu.: 3378.97 3rd Qu.: 629.192 3rd Qu.:3.027175
#> Max. :10806.31 Max. : 2195.418 Max. :4.396288
```

This indicates that a positive relationship between Private capital with with Public capital and mixed positive and negative one with Unemployment and the Intercept.

It is instructive to unpick some of the model coefficients in more detail and the code below summarises variations over time through the median values of each coefficient estimate:

```
res |>
select(year, b_Intercept, b_unemp, b_pubC) |>
group_by(year) |>
summarise(med_b0 = median(b_Intercept),
med_b1 = median(b_unemp),
med_b2 = median(b_pubC))
#> # A tibble: 17 × 4
#> year med_b0 med_b1 med_b2
#> <int> <dbl> <dbl> <dbl>
#> 1 1970 -77.8 -313. 1.95
#> 2 1971 -77.8 -313. 1.99
#> 3 1972 -77.8 -313. 2.03
#> 4 1973 -77.8 -313. 2.07
#> 5 1974 -77.8 -313. 2.11
#> 6 1975 -77.8 -313. 2.15
#> 7 1976 -77.8 -313. 2.19
#> 8 1977 -77.8 -313. 2.23
#> 9 1978 -77.8 -313. 2.27
#> 10 1979 -77.8 -313. 2.31
#> 11 1980 -77.8 -313. 2.35
#> 12 1981 -77.8 -313. 2.39
#> 13 1982 -77.8 -313. 2.43
#> 14 1983 -77.8 -313. 2.47
#> 15 1984 -77.8 -313. 2.51
#> 16 1985 -77.8 -313. 2.55
#> 17 1986 -77.8 -313. 2.59
```

It is evident that of the 2 covariates and the intercept used to
model Private capital, only Public capital (`b2`

) varies
(increases) over time. This increase is shown visually below .

```
# inputs to plot
res |> select(starts_with("b"), year) |>
mutate(year = "All Years") -> tmp
cols = c(c4a("tableau.red_gold", n = 17, reverse = T), "grey")
tit =expression(paste(""*beta[`Private Capital`]*""))
# plot
res |> select(starts_with("b"), year) |>
rbind(tmp) |>
mutate(year = factor(year)) |>
ggplot(aes(y = year, x = b_pubC, fill = year)) +
geom_boxplot(outlier.alpha = 0.1) +
scale_fill_manual(values=cols, guide = "none") +
theme_bw() + xlab(tit) + ylab("Time")
```

The spatial pattern of this temporal trend can also be explored as below. This shows that the increasing intensity of the effect of Public capital on Private capital does not vary spatially: the increase in effect is spatially even, with high values in the south.

```
tit =expression(paste(""*beta[`Public Capital`]*""))
# join the data
map_results <-
us_data |> left_join(res |> select(GEOID, year, b_Intercept, b_unemp, b_pubC), by = "GEOID")
# create the plot
map_results |>
ggplot() + geom_sf(aes(fill = b_pubC), col = NA) +
scale_fill_binned_c4a_seq(palette="scico.lajolla", name = tit) +
facet_wrap(~year) +
theme_bw() + xlab("") + ylab("") +
theme(
strip.background =element_rect(fill="white"),
strip.text = element_text(size = 8, margin = margin()),
legend.position = "inside", legend.position.inside = c(0.7, 0.1),
legend.direction = "horizontal",
legend.key.width = unit(1, "cm"),
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
```

The rationale for using GAMs with GP splines for spatially varying coefficient (SVC) or temporally varying coefficient (TVC) models is as follows:

- GAMs with splines or smooths capture non-linear relationships between the response variable and covariates.
- splines generate a varying coefficient model when they are parameterised with more than one variable.
- this is readily extending to the temporal and / or spatial dimensions to generate SVCs, TVCs and STVCs.
- different splines are available, but GP splines reflect Tobler’s First Law of Geography (spatial autocorrelation, process spatial heterogeneity, etc).
- this can be extended to the temporal case on the assumption of temporal decay (similarity decreases over time).
- GAMs are robust, have a rich theoretical background and been subject to much development.

Initial research has demonstrated the formulation and application of a GAM with GP splines calibrated via observation location as a multiscale SVC model: the Geographical Gaussian Process GAM (GGP-GAM) (Comber, Harris, and Brunsdon 2024). The GGP-GAM was compared with the most popular SVC model, Geographically Weighted Regression (GWR) (Brunsdon, Fotheringham, and Charlton 1996) and shown to out-perform Multiscale GWR.

However, when handling space *and* time, simply plugging all
the space time data into specific GAM configuration is to make
potentially unreasonable assumptions about how space and time interact
in spatially and temporally varying coefficient (STVC) models. To
address this workshop has suggested that the full set of models is
investigated to identify the best model or the best set of models. Where
there is a clear winner, this can be applied. Where these is not, as in
the example used then the model coefficients can be combined using
Bayesian Model Averaging (in the second vignette).

Bivand, Roger, Jakub Nowosad, and Robin Lovelace. 2019. “spData:
Datasets for Spatial Analysis.” R package version 0.3. 0, URL
https://CRAN. R-project. org/package= spData.

Brunsdon, Chris, A Stewart Fotheringham, and Martin E Charlton. 1996.
“Geographically Weighted Regression: A Method for Exploring
Spatial Nonstationarity.” *Geographical Analysis* 28 (4):
281–98.

Comber, Alexis, Paul Harris, and Chris Brunsdon. 2024. “Multiscale
Spatially Varying Coefficient Modelling Using a Geographical Gaussian
Process GAM.” *International Journal of Geographical
Information Science* 38 (1): 27–47.

Comber, Alexis, Paul Harris, Naru Tsutsumida, Jennie Gray, and Chris
Brunsdon. submitted. “Where, When and How? Specifying Spatially
and Temporally Varying Coefficient Models Using Generalized Additive
Models with Gaussian Process Splines.” *International Journal
of Geographical Information Science*, submitted.

Croissant, Yves, Giovanni Millo, and Kevin Tappe. 2022. “Plm:
Linear Models for Panel Data.”

Wood, Simon N. 2020. “Inference and Computation with Generalized
Additive Models and Their Extensions.” *TEST* 29 (2):
307–39. https://doi.org/10.1007/s11749-020-00711-5.