[R-sig-ME] Controlling for self-selection bias / endogeneity in mixed models

Mon Apr 13 02:45:45 CEST 2020

  Wow, this is the kind of content I come here for.  (It will take me
a while to digest this ...) Thank you!

On Sun, Apr 12, 2020 at 8:36 PM John Poe <jdpoe223 using gmail.com> wrote:
>
> Hi Kelly,
>
> It sounds like you've got correct reasoning on the need for a multilevel
> model if your variable of interest is time invariant.
>
> Can you post a link to the thread you're referencing?
>
> A bit of clarity on the flavor(s) of endogeneity that concern you might be
> helpful. The omitted variable bias issues solved by group mean centering
> and the Mundlak device are mostly from model mis/underspecification whereas
> sample selection is a fundamentally different mechanism. Both are common
> sources of endogeneity recognized as such in different pockets of econ but
> they tend to be seen as fundamentally different (often conceptually
> unrelated) problems in other fields. Econ subsumes omitted variables, joint
> causation, measurement error, and sample selection under the endogeneity
> umbrella because they all cause correlation between X and the error but
> other fields don't make the same connection. For instance, early panel data
> work talked about Mundlak devices as "instruments" in the same way that
> dynamic panel data models talk about lags and first differences as
> instruments but they aren't traditional instrumental variables that you'd
> find in the wild and arguably wouldn't pass the exclusion restriction test
> outside of panel data. They call them instruments because they instrument
> the endogeneity but they aren't "instrumental variables" in the common
> parlance.
>
> It's not clear to me if you are referring to general omitted variable bias
> whereby you don't have all the appropriate variables in the model or sample
> selection bias a la Heckman whereby the sample under study is
> systematically different from the population to which you would like to
> make inferences and thus needs some kind of complex propensity to choose A
> or B style correction like with the standard selection model. I'm not clear
> specifically because you referenced the inverse mills ratio but it *sounds*
> like you just think you are possibly missing some set of confounders due to
> the lack of randomization. If you do have sample selection bias you can use
> a multilevel variant of a heckman selection model with random effects in
> the outcome and selection equations. See Grilli, L., & Rampichini, C.
> (2010). Selection bias in linear mixed models. *Metron, 68*(3), 309-329 for
> the best discussion of the topic that I've read. Most multilevel modeling
> work with this kind of problem is based on multilevel propensity score
> matching which is a close cousin of multilevel Heckman selection models as
> the inverse mills ratio and the propensity score are related.
>
> You're right that the addition of group means per Mundlak segregates the
> within and between effects into two different sets of betas when they would
> otherwise be a weighted average. It's just a reparamaritization of the
> dummy variable version of fixed effects. It is mathematically impossible in
> a linear model for a group mean centered multilevel model to return
> different within group beta coefficients than the standard FE model. That
> doesn't mean that both of them aren't wrong because of cross-level
> interactions, measurement error, selection bias and what not but they would
> both be wrong in identical ways. You can directly test that they are
> identical with a version of a Hausman test comparing the within group betas
> with a chi2 test. The degrees of freedom calculation will be off from the
> regular test because the between effects add extra but the within effects
> will be identical to rounding error so it really won't matter. You can also
> just do a Mundlak variation on the test. All panel data econometrics
> textbooks outline this and you can justify the modeling strategy that way
> regardless of reviewer misconceptions.
>
> If the FE or group mean centered MLM are both wrong and there's some kind
> of interactive effect still at work then a random coefficient will likely
> show up as mattering for model fit with something like an LR test. If beta
> (X_i-Xbar_j) on Y does not vary as a function of group per an LR test or
> something fancier like WAIC then it is reasonable (but not infallible)
> evidence that you don't have group heterogeneity-related omitted variable
> bias which is what economists would typically be concerned about in this
> context. You can still have other kinds of bias at work just like with any
> other kind of observational model. The random coefficient in this context
> is a regularized interactive fixed effect in econ jargon whereby you are
> interacting the grouping structure with whatever X you want and getting a
> distribution of effects. Fundamentally, it's like saying you have some kind
> of conditional relationship between group/person and X and just interacting
> them. It's slightly complicated by the fact that empirical bayes shrinkage
> exists but if you have balanced panels then it's mostly a non issue.
>
>
>
> On Sun, Apr 12, 2020 at 7:34 PM Slaughter, Kelly <KELLY.SLAUGHTER using tcu.edu>
> wrote:
>
> > Hi all -
> >
> > I have a concern regarding self-selection/omitted variable bias. I have a
> > longitudinal/repeated measures model, theorizing about a relationship
> > between treatment/control and effort, represented in nlme syntax as:
> >
> > EQ 1) log(effort measured in time) ~ treatment*scale(experience), random =
> > ~1|subject
> >
> > Treatment/control is selected by the subject, it is not randomized, thus
> > raising endogeneity concerns. My background is applied econ, so as I learn
> > the mixed model domain, I expected to find the mixed model equivalent of
> > instrumental variables/inverse Mills ratio, etc. Yet there is surprisingly
> > (to me) limited material addressing this issue. The best reference material
> > I found is in fact a thread in this mailing list from October 2016 and the
> > papers referenced within, leading to Bell, Fairbrother, and Jones (2019).
> > My first impression is that I should employ a within-between random effects
> > (REWB)model -
> >
> > EQ 2) log(effort measured in time) ~ treatment*scale(experience) +
> > experience_between + experience_within, random = experience_within +
> > scale(experience) | subject
> >
> > If I understand correctly, the intuition is that the addition of a group
> > mean explanatory variable "breaks out" the variability that would be
> > associated with an omitted variable / error term. Per Bell et al, "there
> > can be no correlation between level 1 variables included in the model and
> > the level 2 random effects...unchanging and/or unmeasured characteristics
> > of an individual (such as intelligence, ability, etc.) will be controlled
> > out of the estimate of the within effect."
> >
> > So, no concern between the subject (level 2) and treatment (level 1) via
> > REWB, wonderful!
> >
> > Bell et al caution, "...in a REWB/Mundlak models, unmeasured level 2
> > characteristics can cause bias in the estimates of between effects and
> > effects of other level 2 variables."
> >
> > Not an issue for me - I am not concerned with level 2, I include subject
> > to address the IID violation but am interested in population, not subject,
> > performance.
> >
> > Bell et al continue, "However, unobserved time-varying characteristics can
> > still cause biases at level 1 in either an FE or a REWB/Mundlak model."
> >
> > Though conceptually my treatment variable is time-varying (it can change
> > across time within a subject), as a practical/empirical matter, the
> > treatment is unchanging within the subject - subjects have no reason to
> > change / would prefer to keep the choice constant. Of 80k records,
> > treatment switches within a subject occur in about a dozen records.
> >
> > So, I think I have my solution. However, if a reviewer is not happy with
> > the with-in / between REWB solution (worried about the level 1 bias), I can
> > further defend EQ 2 via its random coefficient/slope, if I understand the
> > Oct 2016 thread correctly.
> >
> > So, my questions are:
> >
> > (1) Is the above correctly reasoned?
> >
> > (2) If the random slope model is a further defense against self-selection
> > bias, could someone provide an intuitive explanation as to why? Is the idea
> > that by allowing slopes to vary, there is no endogeneity problem to solve
> > as the very structure of the model makes the correlated errors concern
> > irrelevant?
> >
> > Other solutions I explore include a Mundlak model, but per Bell et al, the
> > Mundlak models are not meaningful for repeated measures. Also, it appears
> > that the brms package appears to support mixed modeling using instrumental
> > variables, something I am more comfortable with per my background, but
> > strong instrumental variables are hard to find in the wild!
> >
> > Thank you! - Kelly
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-mixed-models using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models