[R-sig-ME] Weird behaviour of prediction intervals for sums of linear functions of predicted values

Fri Mar 29 14:31:57 CET 2019

Dear all,

I face a rather strange problem in lme4 where I would very much appreciate
the incredible expertise of the people here. Thanks in advance!

A stated, my case is rather peculiar and the motivation to what I’m doing
hard to explain, but it relates to the concept of Multilevel Regression
with Poststratification (MRP). A short explanation follows but I guess you
can skip the next paragraph it if you don’t care about why I do the
seemingly weird things I’m describing afterwards.

The basic idea of MRP is that you want to get correct (or “representative”)
estimates of opinion (agree vs. disagree in the simplest case) in a
population using potentially skewed samples. In your sample you have
information on the opinion of interest as well as some demographic
information on each individual (like age, gender, state, etc.) for each
observation. You use the demographic variables as random effects (despite
the perhaps low number of levels) and the binary opinion as the dependent
variable in a mixed model and then predict for all possible combinations of
independent variables (which grows quite fast with the number of
independent variables) the probability of agreement, which you then
multiply with the actual number of people with that set of demographic
features in the population (ground truth which comes from an external
source). You can then aggregate over these cells as you please to get for
example the share of women in each state who agreed or the number of people
over 65 who did not. In general you like to use a full bayesian approach
for this procedure but computational considerations prohibit such a
solution in my case and that’s why I’m using lme4.

So what am I actually doing? I have a large set of categorical
(demographic) variables (most of them with not more than 10 levels) which
might be related to a binary outcome. I perform variable selection using a
kind of sparse group-LASSO to drop some of them completely (if the
coefficients for all categories of the variable are shrunk to 0) and reduce
the number of levels on others (by merging those that were reduced to 0).
By doing so I get a smaller set of categorical predictors which I then use
as random effects in a mixed logit model (utilizing glmer) if they have
more than two levels (else they are excluded as fixed effects, because as
far as I understood Gelman & Hill RE and FE are in fact identical in case
of binary variables). Based on this model I create predictions for all
possible combinations of the independent variables, weigh them by some
value and aggregate them over a selection of the independent variables.

Despite the frequently low number of levels on the random effect-variables
this approach works surprisingly well, at least as long as I am interested
in point estimates. But getting interval estimates without applying rather
expensive bootstrapping techniques is a whole other topic, which leads me
to my question: I use the sim()-function from the arm-package to sample
from the I guess you could call it posterior of each prediction a few
hundred times and aggregate these samples as described to get a rough and
fast impression of the underlying distribution which I can use to estimate
some kind of prediction interval while being aware that I most likely
underestimate the error in my predictions by doing this due to the fact
that the uncertainty of certain parameters is ignored using this method. .

Again this works better then expected but it sometimes causes some weird
idiosyncrasies, especially in combination with the variable selection. As
the LASSO-lambda is determined using cross validation there is always a
small stochastic element in the selection of variables. And that’s the
issue: If a certain level of a predictor which exhibits at most a very weak
relationship to the dependent variable is selected or not sometimes changes
the estimated width of the prediction interval by a very large margin
(200-300%), without influencing the point estimates at all. Something
similar happens using predictInterval() from the merTools-package, so it
isn't just a sim()-related issue.

So does somebody have an idea what might explain this weird behavior and
what to do about it or can point me into directions to further debug it?

Thank you so much!

Tobias

	[[alternative HTML version deleted]]