[R-sig-ME] Optimism introduced by non-converging models in bootstrap validation of GLMM (lme4)

Sun May 12 19:26:29 CEST 2019

Hi,

I have a random-intercept logistic GLMM fit to a dataset of 650
observations. The model discriminates the binary outcomes of the dataset
very nicely (C-index aka ROC AUC equals .85). But a reviewer suggested that
in a dataset this small, the model's discrimination performance should be
tested through a bootstrap procedure.

My script draws a bootstrap sample from the dataset, fits the model to it,
then uses the somers2() function from the Hmisc library to calculate the
C-index of the boostrap-fit model when used to discriminate the original
data. This is repeated 1000 times:

# The original dataset is called d
require(lme4)
require(Hmisc)
n <- 1000
c.index <- numeric(length = n)
set.seed(2019)
for(i in 1:n){
  bootsample <- d[sample(nrow(d), replace = T),]
  tmpmod <- update(MyModel, ~., data = bootsample)
  c.index[i] <- somers2(predict(tmpmod, newdata = d, re.form = NULL,
allow.new.levels =  TRUE), d$dep.var)["C"]
}

It turns out that at .854, average discrimination in 1000 bootstrap
iterations is slightly Higher than the original model's in-sample
discrimination (.85). There must be an error, no? Surely the out-of-sample
performance of any model should be worse, on average, than its in-sample
performance?

My actual code includes additional, fairly messy bits which attempt to
refit non-converging models using alternative optimizers (from the optimx
package) and, failing that, draws a new bootstrap sample if the model
cannot be fit to the first bootstrap sample for that iteration. In total,
there were 128 bootstrap samples for which the model failed to converge,
necessitating an alternative bootstrap sample. My suspicion is that this is
the reason behind the too-good-to-be-true results -- the exclusion of those
bootstrap 128 samples for which the model failed to converge has introduced
an optimistic bias into the bootstrap sampling, such that favorable
bootstrap samples are overrepresented and unfavorable ones
underrepresented. The question is: since it is impossible to achieve
convergence in every bootstrap sample, is there some heuristic for
estimating the degree of optimism introduced by bootstrap samples that had
to be discarded due to non-convergence?

Best,

J

	[[alternative HTML version deleted]]