[R-sig-ME] Controlling for self-selection bias / endogeneity in mixed models

Mon Apr 13 03:21:30 CEST 2020

Ah, okay I see the problem now. This kind of multilevel causal inference
problem is a bit hard for me to conceptualize. I usually think about them
with DAGs.

I *think* you're going to end up trying to model the selection mechanism
itself via something like propensity score weighting unless you can find a
good natural IV. In this context the propensity score is an artificial
instrumental variable (much like randomization is an instrument). You can
find a good explanation of IPW in Hernan and Robins
https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/ which
includes some detail on longitudinal models though that is geared to time
varying treatments. I think you'll just be focusing on building a
propensity score at the time of the choice since it never changes which
simplifies it down to the first cross-section of data. I'm familiar with 15
or 20ish papers on multilevel propensity score modeling so they are easy to
find. One that you might look at is Arpino, B. and Mealli, F., 2011. The
specification of the propensity score in multilevel observational
studies. *Computational
Statistics & Data Analysis*, *55*(4), pp.1770-1780. Arpino has several
papers on the topic including a statistics in medicine article that's also
pretty good. Causal identification is going to be based on how good the
propensity score is and there's no real way around that. Once you get the
weighted (or matched if you want to go that route) data you can put it in a
regular multilevel model.

It's possible that you could model this with cross-level interactions
between ownership and all the level 1 stuff in the model but that would get
messy. I think the propensity score route is at least more straightforward
to interpret. If you had pre-treatment outcome data of some kind then you
could do something like a synthetic control method but I don't know if
that's feasible with what you've got.

On Sun, Apr 12, 2020 at 8:56 PM Slaughter, Kelly <KELLY.SLAUGHTER using tcu.edu>
wrote:

> Thanks for the extensive reply, John! Before I attempt to absorb it all,
> let me offer a couple of quick answers to your questions just to be sure
> the thread does not spiral in multiple directions :)
>
> (1)     The beginning of the thread I reference can be found here:
> https://hypatia.math.ethz.ch/pipermail/r-sig-mixed-models/2016q4/025147.html
>
> (2)     I am referring to omitted variable bias, sorry for the confusion.
> My treatment / control is ownership of multiple financial accounts /
> ownership of single accounts. So perhaps let's say IQ tends to make someone
> more likely to hold multiple accounts (treatment) AND allows them to expend
> less effort in researching financial trades (outcome variable), whereas I
> am theorizing that multiple accounts themselves reduce effort directly.
>
> BTW, Ben, thank you for your extensive support across multiple sites in
> helping the general public with mixed models in R. I have relied upon an
> EXTENSIVE number of your answers to mixed model questions when developing
> my models.
>
> -----Original Message-----
> From: Ben Bolker <bbolker using gmail.com>
> Sent: Sunday, April 12, 2020 7:46 PM
> To: John Poe <jdpoe223 using gmail.com>
> Cc: Slaughter, Kelly <KELLY.SLAUGHTER using tcu.edu>;
> r-sig-mixed-models using r-project.org
> Subject: Re: [R-sig-ME] Controlling for self-selection bias / endogeneity
> in mixed models
>
>   Wow, this is the kind of content I come here for.  (It will take me a
> while to digest this ...) Thank you!
>
> On Sun, Apr 12, 2020 at 8:36 PM John Poe <jdpoe223 using gmail.com> wrote:
> >
> > Hi Kelly,
> >
> > It sounds like you've got correct reasoning on the need for a
> > multilevel model if your variable of interest is time invariant.
> >
> > Can you post a link to the thread you're referencing?
> >
> > A bit of clarity on the flavor(s) of endogeneity that concern you
> > might be helpful. The omitted variable bias issues solved by group
> > mean centering and the Mundlak device are mostly from model
> > mis/underspecification whereas sample selection is a fundamentally
> > different mechanism. Both are common sources of endogeneity recognized
> > as such in different pockets of econ but they tend to be seen as
> > fundamentally different (often conceptually
> > unrelated) problems in other fields. Econ subsumes omitted variables,
> > joint causation, measurement error, and sample selection under the
> > endogeneity umbrella because they all cause correlation between X and
> > the error but other fields don't make the same connection. For
> > instance, early panel data work talked about Mundlak devices as
> > "instruments" in the same way that dynamic panel data models talk
> > about lags and first differences as instruments but they aren't
> > traditional instrumental variables that you'd find in the wild and
> > arguably wouldn't pass the exclusion restriction test outside of panel
> > data. They call them instruments because they instrument the
> > endogeneity but they aren't "instrumental variables" in the common
> parlance.
> >
> > It's not clear to me if you are referring to general omitted variable
> > bias whereby you don't have all the appropriate variables in the model
> > or sample selection bias a la Heckman whereby the sample under study
> > is systematically different from the population to which you would
> > like to make inferences and thus needs some kind of complex propensity
> > to choose A or B style correction like with the standard selection
> > model. I'm not clear specifically because you referenced the inverse
> > mills ratio but it *sounds* like you just think you are possibly
> > missing some set of confounders due to the lack of randomization. If
> > you do have sample selection bias you can use a multilevel variant of
> > a heckman selection model with random effects in the outcome and
> selection equations. See Grilli, L., & Rampichini, C.
> > (2010). Selection bias in linear mixed models. *Metron, 68*(3),
> > 309-329 for the best discussion of the topic that I've read. Most
> > multilevel modeling work with this kind of problem is based on
> > multilevel propensity score matching which is a close cousin of
> > multilevel Heckman selection models as the inverse mills ratio and the
> propensity score are related.
> >
> > You're right that the addition of group means per Mundlak segregates
> > the within and between effects into two different sets of betas when
> > they would otherwise be a weighted average. It's just a
> > reparamaritization of the dummy variable version of fixed effects. It
> > is mathematically impossible in a linear model for a group mean
> > centered multilevel model to return different within group beta
> > coefficients than the standard FE model. That doesn't mean that both
> > of them aren't wrong because of cross-level interactions, measurement
> > error, selection bias and what not but they would both be wrong in
> > identical ways. You can directly test that they are identical with a
> > version of a Hausman test comparing the within group betas with a chi2
> > test. The degrees of freedom calculation will be off from the regular
> > test because the between effects add extra but the within effects will
> > be identical to rounding error so it really won't matter. You can also
> > just do a Mundlak variation on the test. All panel data econometrics
> > textbooks outline this and you can justify the modeling strategy that
> way regardless of reviewer misconceptions.
> >
> > If the FE or group mean centered MLM are both wrong and there's some
> > kind of interactive effect still at work then a random coefficient
> > will likely show up as mattering for model fit with something like an
> > LR test. If beta
> > (X_i-Xbar_j) on Y does not vary as a function of group per an LR test
> > or something fancier like WAIC then it is reasonable (but not
> > infallible) evidence that you don't have group heterogeneity-related
> > omitted variable bias which is what economists would typically be
> > concerned about in this context. You can still have other kinds of
> > bias at work just like with any other kind of observational model. The
> > random coefficient in this context is a regularized interactive fixed
> > effect in econ jargon whereby you are interacting the grouping
> > structure with whatever X you want and getting a distribution of
> > effects. Fundamentally, it's like saying you have some kind of
> > conditional relationship between group/person and X and just
> > interacting them. It's slightly complicated by the fact that empirical
> bayes shrinkage exists but if you have balanced panels then it's mostly a
> non issue.
> >
> >
> >
> > On Sun, Apr 12, 2020 at 7:34 PM Slaughter, Kelly
> > <KELLY.SLAUGHTER using tcu.edu>
> > wrote:
> >
> > > Hi all -
> > >
> > > I have a concern regarding self-selection/omitted variable bias. I
> > > have a longitudinal/repeated measures model, theorizing about a
> > > relationship between treatment/control and effort, represented in nlme
> syntax as:
> > >
> > > EQ 1) log(effort measured in time) ~ treatment*scale(experience),
> > > random = ~1|subject
> > >
> > > Treatment/control is selected by the subject, it is not randomized,
> > > thus raising endogeneity concerns. My background is applied econ, so
> > > as I learn the mixed model domain, I expected to find the mixed
> > > model equivalent of instrumental variables/inverse Mills ratio, etc.
> > > Yet there is surprisingly (to me) limited material addressing this
> > > issue. The best reference material I found is in fact a thread in
> > > this mailing list from October 2016 and the papers referenced within,
> leading to Bell, Fairbrother, and Jones (2019).
> > > My first impression is that I should employ a within-between random
> > > effects (REWB)model -
> > >
> > > EQ 2) log(effort measured in time) ~ treatment*scale(experience) +
> > > experience_between + experience_within, random = experience_within +
> > > scale(experience) | subject
> > >
> > > If I understand correctly, the intuition is that the addition of a
> > > group mean explanatory variable "breaks out" the variability that
> > > would be associated with an omitted variable / error term. Per Bell
> > > et al, "there can be no correlation between level 1 variables
> > > included in the model and the level 2 random effects...unchanging
> > > and/or unmeasured characteristics of an individual (such as
> > > intelligence, ability, etc.) will be controlled out of the estimate of
> the within effect."
> > >
> > > So, no concern between the subject (level 2) and treatment (level 1)
> > > via REWB, wonderful!
> > >
> > > Bell et al caution, "...in a REWB/Mundlak models, unmeasured level 2
> > > characteristics can cause bias in the estimates of between effects
> > > and effects of other level 2 variables."
> > >
> > > Not an issue for me - I am not concerned with level 2, I include
> > > subject to address the IID violation but am interested in
> > > population, not subject, performance.
> > >
> > > Bell et al continue, "However, unobserved time-varying
> > > characteristics can still cause biases at level 1 in either an FE or a
> REWB/Mundlak model."
> > >
> > > Though conceptually my treatment variable is time-varying (it can
> > > change across time within a subject), as a practical/empirical
> > > matter, the treatment is unchanging within the subject - subjects
> > > have no reason to change / would prefer to keep the choice constant.
> > > Of 80k records, treatment switches within a subject occur in about a
> dozen records.
> > >
> > > So, I think I have my solution. However, if a reviewer is not happy
> > > with the with-in / between REWB solution (worried about the level 1
> > > bias), I can further defend EQ 2 via its random coefficient/slope,
> > > if I understand the Oct 2016 thread correctly.
> > >
> > > So, my questions are:
> > >
> > > (1) Is the above correctly reasoned?
> > >
> > > (2) If the random slope model is a further defense against
> > > self-selection bias, could someone provide an intuitive explanation
> > > as to why? Is the idea that by allowing slopes to vary, there is no
> > > endogeneity problem to solve as the very structure of the model
> > > makes the correlated errors concern irrelevant?
> > >
> > > Other solutions I explore include a Mundlak model, but per Bell et
> > > al, the Mundlak models are not meaningful for repeated measures.
> > > Also, it appears that the brms package appears to support mixed
> > > modeling using instrumental variables, something I am more
> > > comfortable with per my background, but strong instrumental variables
> are hard to find in the wild!
> > >
> > > Thank you! - Kelly
> > >
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > _______________________________________________
> > > R-sig-mixed-models using r-project.org mailing list
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_ma
> > > ilman_listinfo_r-2Dsig-2Dmixed-2Dmodels&d=DwIBaQ&c=7Q-FWLBTAxn3T_E3H
> > > WrzGYJrC4RvUoWDrzTlitGRH_A&r=t-hV_EQcvMxUUCFqXmGPFL3N6XmAH6-xWI5Xpn-
> > > HlYI&m=QIwJJAou0NQyfk892Wz-BodAH5I2A4aX08LX_ruukNk&s=4wSiK6P7-7_81bm
> > > iLGX2F07zLv-M28Gd-4vDdwHogyk&e=
> > >
> >
> >         [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-mixed-models using r-project.org mailing list
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> > man_listinfo_r-2Dsig-2Dmixed-2Dmodels&d=DwIBaQ&c=7Q-FWLBTAxn3T_E3HWrzG
> > YJrC4RvUoWDrzTlitGRH_A&r=t-hV_EQcvMxUUCFqXmGPFL3N6XmAH6-xWI5Xpn-HlYI&m
> > =QIwJJAou0NQyfk892Wz-BodAH5I2A4aX08LX_ruukNk&s=4wSiK6P7-7_81bmiLGX2F07
> > zLv-M28Gd-4vDdwHogyk&e=
>

	[[alternative HTML version deleted]]