[R-sig-ME] Optimism introduced by non-converging models in bootstrap validation of GLMM (lme4)

Sun May 12 19:59:30 CEST 2019

  Quick thought: "non-convergence" doesn't necessarily mean the fit is
actually bad (false positives blah blah blah), and in most (all?) cases
you actually get a working fitted model (i.e., you could get the
C-index).  If you compare the distribution of C-indices from converging
and non-converging models, what do you see?

  Also: is bootstrapping at the level of observations OK, or should you
be bootstrapping at the level of groups (or hierarchically, i.e. both
among and within groups)?

  I'm also not 100% sure that the discrimination must be lower in the
bootstrap samples -- among other things, the model is *not* maximizing
discrimination, it's maximizing likelihood (which are presumably
asymptotically equivalent but needn't be identical ... ?)

On 2019-05-12 1:26 p.m., Juho Kristian Ruohonen wrote:
> Hi,
> 
> I have a random-intercept logistic GLMM fit to a dataset of 650
> observations. The model discriminates the binary outcomes of the dataset
> very nicely (C-index aka ROC AUC equals .85). But a reviewer suggested that
> in a dataset this small, the model's discrimination performance should be
> tested through a bootstrap procedure.
> 
> My script draws a bootstrap sample from the dataset, fits the model to it,
> then uses the somers2() function from the Hmisc library to calculate the
> C-index of the boostrap-fit model when used to discriminate the original
> data. This is repeated 1000 times:
> 
> # The original dataset is called d
> require(lme4)
> require(Hmisc)
> n <- 1000
> c.index <- numeric(length = n)
> set.seed(2019)
> for(i in 1:n){
>   bootsample <- d[sample(nrow(d), replace = T),]
>   tmpmod <- update(MyModel, ~., data = bootsample)
>   c.index[i] <- somers2(predict(tmpmod, newdata = d, re.form = NULL,
> allow.new.levels =  TRUE), d$dep.var)["C"]
> }
> 
> It turns out that at .854, average discrimination in 1000 bootstrap
> iterations is slightly Higher than the original model's in-sample
> discrimination (.85). There must be an error, no? Surely the out-of-sample
> performance of any model should be worse, on average, than its in-sample
> performance?
> 
> My actual code includes additional, fairly messy bits which attempt to
> refit non-converging models using alternative optimizers (from the optimx
> package) and, failing that, draws a new bootstrap sample if the model
> cannot be fit to the first bootstrap sample for that iteration. In total,
> there were 128 bootstrap samples for which the model failed to converge,
> necessitating an alternative bootstrap sample. My suspicion is that this is
> the reason behind the too-good-to-be-true results -- the exclusion of those
> bootstrap 128 samples for which the model failed to converge has introduced
> an optimistic bias into the bootstrap sampling, such that favorable
> bootstrap samples are overrepresented and unfavorable ones
> underrepresented. The question is: since it is impossible to achieve
> convergence in every bootstrap sample, is there some heuristic for
> estimating the degree of optimism introduced by bootstrap samples that had
> to be discarded due to non-convergence?
> 
> Best,
> 
> J
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>