# [R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?

Michael Dewey ||@t@ @end|ng |rom dewey@myzen@co@uk
Thu Jun 9 15:52:35 CEST 2022

```Dear Jade

Do you really need to fit a separate parameter for each heaping day? Can
you not just make it a binary predictor or a categorical one with fewer
levels, perhaps 14 (for heaping in each year) or 12 (for each calendar
month). I have no idea whether that would help but it seems worth a try.

Michael

On 08/06/2022 18:15, jade.shodan--- via R-help wrote:
> Hi Simon,
>
> Thanks so much for this!! I have two follow up questions, if you don't mind.
>
> 1. Does including an autoregressive term not adjust away part of the
> effect of the response in a distributed lag model (where the outcome
> accumulates over time)?
> 2. I've tried to fit a model using bam (just a first attempt without
> AR term), but including the factor variable heap creates errors:
>
> bam0 <- bam(deaths~te(year, month, week, weekday,
> bs=c("cr","cc","cc","cc"), k = c(4,5,5,5)) + heap +
>                        te(temp_max, lag, k=c(8, 3)) +
>                        te(precip_daily_total, lag, k=c(8, 3)),
>                        data = dat, family = nb, method = 'fREML',
> select = TRUE, discrete = TRUE,
>                        knots = list(month = c(0.5, 12.5), week = c(0.5,
> 52.5), weekday = c(0, 6.5)))
>
> This model results in errors:
>
> Warning in estimate.theta(theta, family, y, mu, scale = scale1, wt = G\$w,  :
>    step failure in theta estimation
> Warning in sqrt(family\$dev.resids(object\$y, object\$fitted.values,
> object\$prior.weights)) :
>    NaNs produced
>
>
> Including heap as as.numeric(heap) runs the model without error
> messages or warnings, but model diagnostics look terrible, and it also
> doesn't make sense (to me) to make heap a numeric. The factor variable
> heap (with 169 levels) codes the fact that all deaths for which no
> date was known, were registered on the 15th day of each month. I've
> coded all non-heaping days as 0. All heaping days were coded as a
> value between 1-168. The time series spans 14 years, so a heaping day
> in each month results in 14*12 levels = 168, plus one level for
> non-heaping days.
>
> So my second question is: Does bam allow factor variables? And if not,
> how should I model this heaping on the 15th day of the month instead?
>
> With thanks,
>
>
> On Wed, 8 Jun 2022 at 12:05, Simon Wood <simon.wood using bath.edu> wrote:
>>
>> I would not worry too much about high concurvity between variables like
>> temperature and time. This just reflects the fact that temperature has a
>> strong temporal pattern.
>>
>> I would also not be too worried about the low p-values on the k check.
>> The check only looks for pattern in the residuals when they are ordered
>> with respect to the variables of a smooth. When you have time series
>> data and some smooths involve time then it's hard not to pick up some
>> degree of residual auto-correlation, which you often would not want to
>> model with a higher rank smoother.
>>
>> The NAs  for the distributed lag terms just reflect the fact that there
>> is no obvious way to order the residuals w.r.t. the covariates for such
>> terms, so the simple check for residual pattern is not really possible.
>>
>> One simple approach is to fit using bam(...,discrete=TRUE) which will
>> let you specify an AR1 parameter to mop up some of the residual
>> auto-correlation without resorting to a high rank smooth that then does
>> all the work of the covariates as well. The AR1 parameter can be set by
>> looking at the ACF of the residuals of the model without this. You need
>> to look at the ACF of suitably standardized residuals to check how well
>> this has worked.
>>
>> best,
>>
>> Simon
>>
>> On 05/06/2022 20:01, jade.shodan--- via R-help wrote:
>>> Hello everyone,
>>>
>>> A few days ago I asked a question about concurvity in a GAM (the
>>> anologue of collinearity in a GLM) implemented in mgcv. I think my
>>> question was a bit unfocussed, so I am retrying again, but with
>>> output, it might make for easier
>>>
>>> As mentioned previously, I have problems with concurvity in my thesis
>>> research, and don't have access to a statistician who works with time
>>> series, GAMs or R. I'd be very grateful for any (partial) answer,
>>> however short. I'll gladly return the favour where I can! For really
>>> helpful input I'd be more than happy to offer co-authorship on
>>> no results at all if I can't solve this concurvity issue :(
>>>
>>> I'm using GAMs to try to understand the relationship between deaths
>>> and heat-related variables (e.g. temperature and humidity), using
>>> daily time series over a 14-year period from a tropical, low-income
>>> country. My aim is to understand the relationship between these
>>> variables and deaths, rather than pure prediction performance.
>>>
>>> The GAMs include distributed lag models (set up as 7-column matrices,
>>> see code at bottom of post), since deaths may occur over several days
>>> following exposure.
>>>
>>> Simple GAMs with just time, lagged temperature and lagged
>>> precipitation (a potential confounder) show very high concurvity
>>> between lagged temperature and time, regardless of the many different
>>> ways I have tried to decompose time. The autocorrelation functions
>>> (ACF) however, shows values close to zero, only just about breaching
>>> the 'significance line' in a few instances. It does show patterning
>>> though, although the regularity is difficult to define.
>>>
>>> My questions are:
>>> 1) Should I be worried about the high concurvity, or can I ignore it
>>> given the mostly non-significant ACF? I've read dozens of
>>> heat-mortality modelling studies and none report on concurvity between
>>> weather variables and time (though one 2012 paper discussed
>>> autocorrelation).
>>>
>>> 2) If I cannot ignore it, what should I do to resolve it? Would
>>> including an autoregressive term be appropriate, and if so, where can
>>> I find a coded example of how to do this? I've also come across
>>> sequential regression][1]. Would this be more or less appropriate? If
>>> appropriate, a pointer to an example would be really appreciated!
>>>
>>> Some example GAMs are specified as follows:
>>> ```r
>>> conc38b <- gam(deaths~te(year, month, week, weekday,
>>> bs=c("cr","cc","cc","cc")) + heap +
>>>                         te(temp_max, lag, k=c(10, 3)) +
>>>                         te(precip_daily_total, lag, k=c(10, 3)),
>>>                         data = dat, family = nb, method = 'REML', select = TRUE,
>>>                         knots = list(month = c(0.5, 12.5), week = c(0.5,
>>> 52.5), weekday = c(0, 6.5)))
>>> ```
>>> Concurvity for the above model between (temp_max, lag) and (year,
>>> month, week, weekday) is 0.91:
>>>
>>> ```r
>>> \$worst
>>>                                       para te(year,month,week,weekday)
>>> te(temp_max,lag) te(precip_daily_total,lag)
>>> para                        1.000000e+00                1.125625e-29
>>>        0.3150073                  0.6666348
>>> te(year,month,week,weekday) 1.400648e-29                1.000000e+00
>>>        0.9060552                  0.6652313
>>> te(temp_max,lag)            3.152795e-01                8.998113e-01
>>>        1.0000000                  0.5781015
>>> te(precip_daily_total,lag)  6.666348e-01                6.652313e-01
>>>        0.5805159                  1.0000000
>>> ```
>>>
>>> Output from ```gam.check()```:
>>> ```r
>>> Method: REML   Optimizer: outer newton
>>> full convergence after 16 iterations.
>>> (score 8915.994 & scale 1).
>>> Hessian positive definite, eigenvalue range [5.048053e-05,26.50175].
>>> Model rank =  544 / 544
>>>
>>> Basis dimension (k) checking results. Low p-value (k-index<1) may
>>> indicate that k is too low, especially if edf is close to k'.
>>>
>>>                                     k'      edf k-index p-value
>>> te(year,month,week,weekday) 319.0000  26.6531    0.96    0.06 .
>>> te(temp_max,lag)             29.0000   3.3681      NA      NA
>>> te(precip_daily_total,lag)   27.0000   0.0051      NA      NA
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>> ```
>>>
>>> Some output from ```summary(conc38b)```:
>>> ```r
>>> Approximate significance of smooth terms:
>>>                                     edf Ref.df  Chi.sq p-value
>>> te(year,month,week,weekday) 26.653127    319 166.803 < 2e-16 ***
>>> te(temp_max,lag)             3.368129     27  11.130 0.00145 **
>>> te(precip_daily_total,lag)   0.005104     27   0.002 0.69636
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> R-sq.(adj) =  0.839   Deviance explained = 53.3%
>>> -REML =   8916  Scale est. = 1         n = 5107
>>> ```
>>>
>>>
>>> Below are the ACF plots (note limit y-axis = 0.1 for clarity of
>>> pattern). They show peaks at 5 and 15, and then there seems to be a
>>> recurring pattern at multiples of approx. 30 (suggesting month is not
>>> modelled adequately?). Not sure what would cause the spikes at 5 and
>>> 15. There is heaping of deaths on the 15th day of each month, to which
>>> deaths with unknown date were allocated. This heaping was modelled
>>> with categorical variable/ factor ```heap``` with 169 levels (0 for
>>> all non-heaping days and 1-168 (i.e. 14 * 12 for each subsequent
>>> heaping day over the 14-year period):
>>>
>>>     [2]: https://i.stack.imgur.com/FzKyM.png
>>>     [3]: https://i.stack.imgur.com/fE3aL.png
>>>
>>>
>>> I get an identical looking ACF when I decompose time into (year,
>>> month, monthday) as in model conc39 below, although concurvity between
>>> (temp_max, lag) and the time term has now dropped somewhat to 0.83:
>>>
>>> ```r
>>> conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
>>>                        te(temp_max, lag, k=c(10, 4)) +
>>>                        te(precip_daily_total, lag, k=c(10, 4)),
>>>                        data = dat, family = nb, method = 'REML', select = TRUE,
>>>                        knots = list(month = c(0.5, 12.5)))
>>> ```
>>> ```r
>>>
>>> Method: REML   Optimizer: outer newton
>>> full convergence after 14 iterations.
>>> (score 8915.763 & scale 1).
>>> Hessian positive definite, eigenvalue range [1.894391e-05,26.99215].
>>> Model rank =  323 / 323
>>>
>>> Basis dimension (k) checking results. Low p-value (k-index<1) may
>>> indicate that k is too low, especially if edf is close to k'.
>>>
>>>                                   k'     edf k-index p-value
>>> te(year,month,monthday)    79.0000 25.0437    0.93  <2e-16 ***
>>> te(temp_max,lag)           39.0000  4.0875      NA      NA
>>> te(precip_daily_total,lag) 36.0000  0.0107      NA      NA
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>> ```
>>> Some output from ```summary(conc39)```:
>>> ```r
>>> Approximate significance of smooth terms:
>>>                                   edf Ref.df  Chi.sq  p-value
>>> te(year,month,monthday)    38.75573     99 187.231  < 2e-16 ***
>>> te(temp_max,lag)            4.06437     37  25.927 1.66e-06 ***
>>> te(precip_daily_total,lag)  0.01173     36   0.008    0.557
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> R-sq.(adj) =  0.839   Deviance explained = 53.8%
>>> -REML =   8915  Scale est. = 1         n = 5107
>>> ```
>>>
>>>
>>> ```r
>>> \$worst
>>>                                      para te(year,month,monthday)
>>> te(temp_max,lag) te(precip_daily_total,lag)
>>> para                       1.000000e+00            3.261007e-31
>>> 0.3313549                  0.6666532
>>> te(year,month,monthday)    3.060763e-31            1.000000e+00
>>> 0.8266086                  0.5670777
>>> te(temp_max,lag)           3.331014e-01            8.225942e-01
>>> 1.0000000                  0.5840875
>>> te(precip_daily_total,lag) 6.666532e-01            5.670777e-01
>>> 0.5939380                  1.0000000
>>> ```
>>>
>>> Modelling time as ```te(year, doy)``` with a cyclic spline for doy and
>>> various choices for k or as ```s(time)``` with various k does not
>>> reduce concurvity either.
>>>
>>>
>>> The default approach in time series studies of heat-mortality is to
>>> model time with fixed df, generally between 7-10 df per year of data.
>>> profiles vary with locality due to sociodemographic and environmental
>>> characteristics and b) the choice of df is based on higher income
>>> countries (where nearly all these studies have been done) with
>>> different mortality profiles and so may not be appropriate for
>>> tropical, low-income countries.
>>>
>>> Although the approach of fixing (high) df does remove more temporal
>>> patterns from the ACF (see model and output below), concurvity between
>>> time and lagged temperature has now risen to 0.99! Moreover,
>>> temperature (which has been a consistent, highly significant predictor
>>> in every model of the tens (hundreds?) I have run, has now turned
>>> non-significant. I am guessing this is because time is now a very
>>> wiggly function that not only models/ removes seasonal variation, but
>>> also some of the day-to-day variation that is needed for the
>>> temperature smooth  :
>>>
>>> ```r
>>> conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
>>>                         te(temp_max, lag, k=c(10,3)) +
>>>                         te(precip_daily_total, lag, k=c(10,3)),
>>>                         data = dat, family = nb, method = 'REML', select = TRUE)
>>> ```
>>> Output from ```gam.check(conc20a, rep = 1000)```:
>>>
>>> ```r
>>> Method: REML   Optimizer: outer newton
>>> full convergence after 9 iterations.
>>> (score 8750.13 & scale 1).
>>> Hessian positive definite, eigenvalue range [0.0001420112,15.40832].
>>> Model rank =  336 / 336
>>>
>>> Basis dimension (k) checking results. Low p-value (k-index<1) may
>>> indicate that k is too low, especially if edf is close to k'.
>>>
>>>                                    k'      edf k-index p-value
>>> s(time)                    111.0000 111.0000    0.98    0.56
>>> te(temp_max,lag)            29.0000   0.6548      NA      NA
>>> te(precip_daily_total,lag)  27.0000   0.0046      NA      NA
>>> ```
>>> Output from ```concurvity(conc20a, full=FALSE)\$worst```:
>>>
>>> ```r
>>>                                      para      s(time) te(temp_max,lag)
>>> te(precip_daily_total,lag)
>>> para                       1.000000e+00 2.462064e-19        0.3165236
>>>                   0.6666348
>>> s(time)                    2.462398e-19 1.000000e+00        0.9930674
>>>                   0.6879284
>>> te(temp_max,lag)           3.170844e-01 9.356384e-01        1.0000000
>>>                   0.5788711
>>> te(precip_daily_total,lag) 6.666348e-01 6.879284e-01        0.5788381
>>>                   1.0000000
>>>
>>> ```
>>>
>>> Some output from ```summary(conc20a)```:
>>> ```r
>>> Approximate significance of smooth terms:
>>>                                    edf Ref.df  Chi.sq p-value
>>> s(time)                    1.110e+02    111 419.375  <2e-16 ***
>>> te(temp_max,lag)           6.548e-01     27   0.895   0.249
>>> te(precip_daily_total,lag) 4.598e-03     27   0.002   0.868
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> R-sq.(adj) =  0.843   Deviance explained = 56.1%
>>> -REML = 8750.1  Scale est. = 1         n = 5107
>>> ```
>>>
>>> ACF functions:
>>>
>>> [4]: https://i.stack.imgur.com/7nbXS.png
>>> [5]: https://i.stack.imgur.com/pNnZU.png
>>>
>>> Data can be found on my [GitHub][6] site in the file
>>> [data_cross_validated_post2.rds][7]. A csv version is also available.
>>> This is my code:
>>>
>>> ```r
>>> library(mgcv)
>>>
>>>
>>> # Create matrices for lagged weather variables (6 day lags) based on
>>> example by Simon Wood
>>> # in his 2017 book ("Generalized additive models: an introduction with
>>> R", p. 349) and
>>> # gamair package documentation
>>> (https://cran.r-project.org/web/packages/gamair/gamair.pdf, p. 54)
>>>
>>> lagard <- function(x,n.lag=7) {
>>> n <- length(x); X <- matrix(NA,n,n.lag)
>>> for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1]
>>> X
>>> }
>>>
>>> dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE),
>>> deaths=df\$deaths_total,doy=df\$doy, year = df\$year, month = df\$month,
>>> weekday = df\$weekday, week = df\$week, monthday = df\$monthday, time =
>>> df\$time, heap=df\$heap, heap_bin = df\$heap_bin, precip_hourly_dailysum
>>> = df\$precip_hourly_dailysum)
>>> dat\$temp_max <- lagard(df\$temp_max)
>>> dat\$temp_min <- lagard(df\$temp_min)
>>> dat\$temp_mean <- lagard(df\$temp_mean)
>>> dat\$wbgt_max <- lagard(df\$wbgt_max)
>>> dat\$wbgt_mean <- lagard(df\$wbgt_mean)
>>> dat\$wbgt_min <- lagard(df\$wbgt_min)
>>> dat\$temp_wb_nasa_max <- lagard(df\$temp_wb_nasa_max)
>>> dat\$sh_mean <- lagard(df\$sh_mean)
>>> dat\$solar_mean <- lagard(df\$solar_mean)
>>> dat\$wind2m_mean <- lagard(df\$wind2m_mean)
>>> dat\$sh_max <- lagard(df\$sh_max)
>>> dat\$solar_max <- lagard(df\$solar_max)
>>> dat\$wind2m_max <- lagard(df\$wind2m_max)
>>> dat\$temp_wb_nasa_mean <- lagard(df\$temp_wb_nasa_mean)
>>> dat\$precip_hourly_dailysum <- lagard(df\$precip_hourly_dailysum)
>>> dat\$precip_hourly <- lagard(df\$precip_hourly)
>>> dat\$precip_daily_total <- lagard( df\$precip_daily_total)
>>> dat\$temp <- lagard(df\$temp)
>>> dat\$sh <- lagard(df\$sh)
>>> dat\$rh <- lagard(df\$rh)
>>> dat\$solar <- lagard(df\$solar)
>>> dat\$wind2m <- lagard(df\$wind2m)
>>>
>>>
>>> conc38b <- gam(deaths~te(year, month, week, weekday,
>>> bs=c("cr","cc","cc","cc")) + heap +
>>>                         te(temp_max, lag, k=c(10, 3)) +
>>>                         te(precip_daily_total, lag, k=c(10, 3)),
>>>                         data = dat, family = nb, method = 'REML', select = TRUE,
>>>                         knots = list(month = c(0.5, 12.5), week = c(0.5,
>>> 52.5), weekday = c(0, 6.5)))
>>>
>>> conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
>>>                        te(temp_max, lag, k=c(10, 4)) +
>>>                        te(precip_daily_total, lag, k=c(10, 4)),
>>>                        data = dat, family = nb, method = 'REML', select = TRUE,
>>>                        knots = list(month = c(0.5, 12.5)))
>>>
>>> conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
>>>                         te(temp_max, lag, k=c(10,3)) +
>>>                         te(precip_daily_total, lag, k=c(10,3)),
>>>                         data = dat, family = nb, method = 'REML', select = TRUE)
>>>
>>> ```
>>> Thank you if you've read this far!! :-))
>>>
>>>     [2]: https://i.stack.imgur.com/FzKyM.png
>>>     [3]: https://i.stack.imgur.com/fE3aL.png
>>>     [4]: https://i.stack.imgur.com/7nbXS.png
>>>     [5]: https://i.stack.imgur.com/pNnZU.png
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Simon Wood, School of Mathematics, University of Edinburgh,
>> https://www.maths.ed.ac.uk/~swood34/
>>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help