[R-sig-ME] Significance of B-splines components in mixed-effects logistic regression (glmer)

Fri Sep 21 17:39:15 CEST 2018

  fortunes::fortune("should be done")

  The ANOVA comparisons should be all you need.  car::Anova() or drop1()
or afex::mixed() are all convenience functions for that.  Since
parameters for splines are harder to interpret, you could just leave out
that part of the parameter table ...

  The freakonometrics post you cite concludes:

>  So, it looks like having a lot of non significant components in a
spline regression is not a major issue. And reducing the degrees of
freedom is clearly a bad option.

Furthermore, stepwise throwing-away of terms is a recipe for messing up
your inference (snooping/garden of forking paths).

Your modeling approach looks fine; you *could* use gamm4 to get
penalized regression splines, but again, it's better from an
inferential/non-snooping point of view to pick a sensible model and
stick with it unless it's obviously problematic.

  On a technical level, it's not clear whether the "discrepancy" (not
really) between the summary() results and the anova() results is due to
(1) the combined effect of a term with several components being
significant even when the individual components are not; (2) the
difference between Wald tests (used in summary) and likelihood-based
tests (used in anova()).  This could be disentangled, but IMO it's only
worth it from a pedagogical/exploratory perspective.

  Ben Bolker

On 2018-09-21 10:54 AM, Anne Lerche wrote:
> Good afternoon,
> I have a problem with reporting significance of b-splines components in
> a mixed-effects logistic regression fit in lme4 (caused by a reviewer's
> comment on a paper). After several hours of searching the literature,
> forums and the internet more generally, I have not found a solution and
> therefore turn to the recipients of this mailing list for help. (My
> questions are at the very end of the mail)
> 
> I am trying to model the change in the use of linguistic variable on the
> basis of corpus data. My dataset contains the binary dependent variable
> (DV, variant "a" or "b" being used), 2 random variables (RV1 and RV2,
> both categorical) and three predictors (IV1=time, IV2=another numeric
> variable, IV3=a categorical variable with 7 levels).
> 
> I wasn't sure if I should attach my (modified) dataset, so I'm trying to
> produce an example. Unfortunately, it doesn't give the same results as
> my original dataset.
> 
> library(lme4)
> library(splines)
> library(languageR)
> 
> df <- dative[dative$Modality == "spoken",]
> df <- df[,c("RealizationOfRecipient", "Verb", "Speaker",
> "LengthOfTheme", "SemanticClass")]
> colnames(df) <- c("DV", "RV1", "RV2", "IV2", "IV3")
> set.seed(130)
> df$IV1 <- sample(1:13, 2360, replace = TRUE)
> 
> My final regression model looks like this (treatment contrast coding):
> fin.mod <- glmer(DV~bs(IV1, knots=c(5,9),
> degree=1)+IV2+IV3+(1|RV1)+(1|RV2),
>                  data=df, family=binomial)
> summary(fin.mod, corr=FALSE)
> 
> where the effect of IV1 is modelled as a b-spline with 2 knots and a
> degree of 1. Anova comparisons (of the original dataset) show that this
> model performs significantly better than a) a model without IV1 modelled
> as a b-spline (bs(IV1, knots=c(5,9), degree=1)), b) a model with IV1 as
> a linear predictor (not using bs), c) a model with the df of the spline
> specified instead of the knots (df=3), so that bs chooses knots
> autonomously, and d) a model with only 2 df (bs(IV1, df=2, degree=1)). I
> also ran comparisons with models with quadratic or cubis splines, and
> still my final model performs significantly better.
> 
> The problem is that I am reporting this final model in a paper, and one
> of the reviewers comments that I am reporting a non-significant effect
> of IV1 because according to the coefficients table the variable does not
> seem to have a significant effect (outlier correction does not make a
> big difference):
> 
> Fixed effects:
>                                       Estimate Std. Error z value Pr(>|z|)
> (Intercept)                            0.52473    0.50759   1.034    0.301
> bs(IV1, knots = c(5, 9), degree = 1)1 -0.93178    0.59162  -1.575    0.115
> bs(IV1, knots = c(5, 9), degree = 1)2  0.69287    0.43018   1.611    0.107
> bs(IV1, knots = c(5, 9), degree = 1)3 -0.19389    0.61144  -0.317    0.751
> IV2                                    0.47041    0.11615   4.050
> 5.12e-05 ***
> IV3level2                              0.30149    0.53837   0.560    0.575
> IV3level3                              0.15682    0.48760   0.322    0.748
> IV3level4                             -0.89664    0.18656  -4.806
> 1.54e-06 ***
> IV3level5                             -2.90305    0.68119  -4.262
> 2.03e-05 ***
> IV3level6                             -0.32081    0.29438  -1.090    0.276
> IV3level7                             -0.07038    0.87727  -0.080    0.936
> (coefficients table of the sample dataset will differ)
> 
> I know that the results of anova comparisons and what the coefficients
> table shows are two different things (as in the case of IV3 which also
> significantly improves model quality when added to the regression even
> if only few levels show significant contrasts).
> 
> My questions are:
> How can I justify reporting my regression model when the regression
> table shows only non-significant components for the b-spline term? (Is
> it enough to point to the anova comparisons?)
> Is is possible to keep only some components of the b-spline (as
> suggested here for linear regression:
> https://freakonometrics.hypotheses.org/47681)?
> Is there a better way of modeling the data? I am not very familiar with
> gamm4 or nlme, for example.
> 
> Any help is very much appreciated!
> Thank you,
> Anne
> 
>