[R-sig-ME] Controlling for self-selection bias / endogeneity in mixed models

Mon Apr 13 02:36:24 CEST 2020

Hi Kelly,

It sounds like you've got correct reasoning on the need for a multilevel
model if your variable of interest is time invariant.

Can you post a link to the thread you're referencing?

A bit of clarity on the flavor(s) of endogeneity that concern you might be
helpful. The omitted variable bias issues solved by group mean centering
and the Mundlak device are mostly from model mis/underspecification whereas
sample selection is a fundamentally different mechanism. Both are common
sources of endogeneity recognized as such in different pockets of econ but
they tend to be seen as fundamentally different (often conceptually
unrelated) problems in other fields. Econ subsumes omitted variables, joint
causation, measurement error, and sample selection under the endogeneity
umbrella because they all cause correlation between X and the error but
other fields don't make the same connection. For instance, early panel data
work talked about Mundlak devices as "instruments" in the same way that
dynamic panel data models talk about lags and first differences as
instruments but they aren't traditional instrumental variables that you'd
find in the wild and arguably wouldn't pass the exclusion restriction test
outside of panel data. They call them instruments because they instrument
the endogeneity but they aren't "instrumental variables" in the common
parlance.

It's not clear to me if you are referring to general omitted variable bias
whereby you don't have all the appropriate variables in the model or sample
selection bias a la Heckman whereby the sample under study is
systematically different from the population to which you would like to
make inferences and thus needs some kind of complex propensity to choose A
or B style correction like with the standard selection model. I'm not clear
specifically because you referenced the inverse mills ratio but it *sounds*
like you just think you are possibly missing some set of confounders due to
the lack of randomization. If you do have sample selection bias you can use
a multilevel variant of a heckman selection model with random effects in
the outcome and selection equations. See Grilli, L., & Rampichini, C.
(2010). Selection bias in linear mixed models. *Metron, 68*(3), 309-329 for
the best discussion of the topic that I've read. Most multilevel modeling
work with this kind of problem is based on multilevel propensity score
matching which is a close cousin of multilevel Heckman selection models as
the inverse mills ratio and the propensity score are related.

You're right that the addition of group means per Mundlak segregates the
within and between effects into two different sets of betas when they would
otherwise be a weighted average. It's just a reparamaritization of the
dummy variable version of fixed effects. It is mathematically impossible in
a linear model for a group mean centered multilevel model to return
different within group beta coefficients than the standard FE model. That
doesn't mean that both of them aren't wrong because of cross-level
interactions, measurement error, selection bias and what not but they would
both be wrong in identical ways. You can directly test that they are
identical with a version of a Hausman test comparing the within group betas
with a chi2 test. The degrees of freedom calculation will be off from the
regular test because the between effects add extra but the within effects
will be identical to rounding error so it really won't matter. You can also
just do a Mundlak variation on the test. All panel data econometrics
textbooks outline this and you can justify the modeling strategy that way
regardless of reviewer misconceptions.

If the FE or group mean centered MLM are both wrong and there's some kind
of interactive effect still at work then a random coefficient will likely
show up as mattering for model fit with something like an LR test. If beta
(X_i-Xbar_j) on Y does not vary as a function of group per an LR test or
something fancier like WAIC then it is reasonable (but not infallible)
evidence that you don't have group heterogeneity-related omitted variable
bias which is what economists would typically be concerned about in this
context. You can still have other kinds of bias at work just like with any
other kind of observational model. The random coefficient in this context
is a regularized interactive fixed effect in econ jargon whereby you are
interacting the grouping structure with whatever X you want and getting a
distribution of effects. Fundamentally, it's like saying you have some kind
of conditional relationship between group/person and X and just interacting
them. It's slightly complicated by the fact that empirical bayes shrinkage
exists but if you have balanced panels then it's mostly a non issue.

On Sun, Apr 12, 2020 at 7:34 PM Slaughter, Kelly <KELLY.SLAUGHTER using tcu.edu>
wrote:

> Hi all -
>
> I have a concern regarding self-selection/omitted variable bias. I have a
> longitudinal/repeated measures model, theorizing about a relationship
> between treatment/control and effort, represented in nlme syntax as:
>
> EQ 1) log(effort measured in time) ~ treatment*scale(experience), random =
> ~1|subject
>
> Treatment/control is selected by the subject, it is not randomized, thus
> raising endogeneity concerns. My background is applied econ, so as I learn
> the mixed model domain, I expected to find the mixed model equivalent of
> instrumental variables/inverse Mills ratio, etc. Yet there is surprisingly
> (to me) limited material addressing this issue. The best reference material
> I found is in fact a thread in this mailing list from October 2016 and the
> papers referenced within, leading to Bell, Fairbrother, and Jones (2019).
> My first impression is that I should employ a within-between random effects
> (REWB)model -
>
> EQ 2) log(effort measured in time) ~ treatment*scale(experience) +
> experience_between + experience_within, random = experience_within +
> scale(experience) | subject
>
> If I understand correctly, the intuition is that the addition of a group
> mean explanatory variable "breaks out" the variability that would be
> associated with an omitted variable / error term. Per Bell et al, "there
> can be no correlation between level 1 variables included in the model and
> the level 2 random effects...unchanging and/or unmeasured characteristics
> of an individual (such as intelligence, ability, etc.) will be controlled
> out of the estimate of the within effect."
>
> So, no concern between the subject (level 2) and treatment (level 1) via
> REWB, wonderful!
>
> Bell et al caution, "...in a REWB/Mundlak models, unmeasured level 2
> characteristics can cause bias in the estimates of between effects and
> effects of other level 2 variables."
>
> Not an issue for me - I am not concerned with level 2, I include subject
> to address the IID violation but am interested in population, not subject,
> performance.
>
> Bell et al continue, "However, unobserved time-varying characteristics can
> still cause biases at level 1 in either an FE or a REWB/Mundlak model."
>
> Though conceptually my treatment variable is time-varying (it can change
> across time within a subject), as a practical/empirical matter, the
> treatment is unchanging within the subject - subjects have no reason to
> change / would prefer to keep the choice constant. Of 80k records,
> treatment switches within a subject occur in about a dozen records.
>
> So, I think I have my solution. However, if a reviewer is not happy with
> the with-in / between REWB solution (worried about the level 1 bias), I can
> further defend EQ 2 via its random coefficient/slope, if I understand the
> Oct 2016 thread correctly.
>
> So, my questions are:
>
> (1) Is the above correctly reasoned?
>
> (2) If the random slope model is a further defense against self-selection
> bias, could someone provide an intuitive explanation as to why? Is the idea
> that by allowing slopes to vary, there is no endogeneity problem to solve
> as the very structure of the model makes the correlated errors concern
> irrelevant?
>
> Other solutions I explore include a Mundlak model, but per Bell et al, the
> Mundlak models are not meaningful for repeated measures. Also, it appears
> that the brms package appears to support mixed modeling using instrumental
> variables, something I am more comfortable with per my background, but
> strong instrumental variables are hard to find in the wild!
>
> Thank you! - Kelly
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

	[[alternative HTML version deleted]]