[R] mgcv: estimate concurvity vs worst concurvity in GAMs

Sat Jul 2 20:20:49 CEST 2022

Dear list members,

I was wondering if someone could explain (in conceptual terms) how to
interpret *estimate* concurvity in a GAM implemented with mgcv and how
it differs from *worst* concurvity (as obtained through mgcv's
concurvity function). I understand that concurvity is the
non-parametric analogue of collinearity in GAMs and that it represents
the extent to which a smooth term can be approximated by one or more
of the other smooth terms in the model.

It seems to be common practice to base one's course of action on the
worst concurvity estimate (as e.g. advised in Noam Ross' course on
GAMs). However, the mgcv help page for concurvity states that worst
concurvity is a "fairly pessimistic measure, as it looks at the worst
case irrespective of data", whereas estimate concurvity "does not
suffer from the pessimism or potential for over-optimism of the
previous two measures, but is less easy to understand".

Worst concurvity is extremely high in my GAMs, whereas estimate
concurvity is much lower (see below), so I am unsure as to whether I
should deal with the concurvity. I should stress that the aim of my
model is to gain an understanding of the relationship between
variables, rather than pure prediction performance.

For those interested, concurvity values are for a GAM with number of
daily deaths as response variable, and a smooth of time, a smooth of a
heat variable (wbgt_mean) and a smooth of precipitation as predictors,
the latter one being a potential confounder and wbgt_mean being the
variable of interest.  Heat and precipitation are modelled as having
distributed lag (6 days), set up as 7 column matrices as per Simon
Wood's book on GAMs (2017, p. 352). The model is as follows:

c1b <- gam(deaths_ip~s(time, k=200) + te(wbgt_mean, lag, k=c(12, 4)) +
te(precip_daily_total, lag, k=c(12, 4)), data = dat, family = nb,
method = 'REML', select = TRUE)

Let's ignore the issue of time decomposition for the moment, since
none of the many ways I've tried reduced concurvity much.

Depending on whether I have to deal with concurvity or not, my models
and their interpretation will look very different. As a potential
solution for high concurvity, I developed alternative models using a
detrended measure of heat as a predictor (by using residuals from a
GAM with heat as response and time as a predictor). Same for
precipitation. This does reduce concurvity substantially, but it
severely reduces the practical application/ interpretation of the
results, so I'd rather not take this route if I can avoid it.
(Modelling with an autoregressive term, as helpfully suggested in
response to a previous post, did not help to reduce concurvity
either).

Below is the output from the concurvity function with argument
full=TRUE. Explanations of estimate vs worst concurvity will be very
gratefully received!

                      para          s(time)       te(wbgt_mean,lag)
   te(precip_daily_total,lag)
worst         0.957257     0.96533049         0.9811214
  0.9749704
observed   0.957257     0.03825656         0.7652984                  0.8568042
estimate    0.957257     0.04334243         0.4197013                  0.5975567

And with argument full= FALSE:

$worst
                                        para
s(time)          te(wbgt_mean,lag)        te(precip_daily_total,lag)
para                                 1.000000e+00    6.033833e-17
  0.04891485                  0.6677235
s(time)                             6.033871e-17    1.000000e+00
 0.96109743                  0.7784443
te(wbgt_mean,lag)          4.006085e-02    9.521445e-01
1.00000000                  0.6941748
te(precip_daily_total,lag) 6.677235e-01    7.784443e-01
0.70026895                  1.0000000

$observed
                                        para
s(time)             te(wbgt_mean,lag)       te(precip_daily_total,lag)
para                                 1.000000e+00    2.203611e-27
5.898608e-33                  0.5790756
s(time)                             6.033871e-17    1.000000e+00
7.485690e-01                  0.2652631
te(wbgt_mean,lag)          4.006085e-02    1.928856e-02
1.000000e+00                  0.1902533
te(precip_daily_total,lag) 6.677235e-01    1.914191e-02
3.076595e-01                  1.0000000

$estimate
                                            para
s(time)              te(wbgt_mean,lag)
te(precip_daily_total,lag)
para                                 1.000000e+00   4.023819e-23
6.199709e-33                  0.2582767
s(time)                             6.033871e-17    1.000000e+00
4.031365e-01                  0.3198069
te(wbgt_mean,lag)          4.006085e-02    2.519282e-02
1.000000e+00                  0.2498455
te(precip_daily_total,lag) 6.677235e-01    1.644668e-02
1.119781e-01                  1.0000000