[R-sig-ME] Controlling for self-selection bias / endogeneity in mixed models

Mon Apr 13 13:53:05 CEST 2020

Thank you, Daniel. 

Yes, I have the time invariant "treatment" as a level 1/fixed effect, and am further hypothesizing that "treatment" is more important as one gains "experience" (thus an interaction variable). The variable I am considering de-meaning / group meaning is "experience".

I scaled "experience" originally to address convergence issues, but as the R implementation of scaling also centers, I also addressed the collinearity between the main and interaction variables. But I can center without scaling of course. 

Your first link did not work for me, but the general site referenced in the link as well as the "parameters" links look potentially quite helpful. I will review/run your gist to better understand the impact of within/between and REWB, thank you very much!

-----Original Message-----
From: Daniel Lüdecke <d.luedecke using uke.de> 
Sent: Monday, April 13, 2020 2:23 AM
To: Slaughter, Kelly <KELLY.SLAUGHTER using tcu.edu>; r-sig-mixed-models using r-project.org
Subject: AW: [R-sig-ME] Controlling for self-selection bias / endogeneity in mixed models

Hi Kelly,

> Not an issue for me - I am not concerned with level 2, I include 
> subject
to address the IID violation but am interested in population, not subject, performance.

If your variable is practically time constant (or time invariant), you can add it as normal predictor, and you don't need the de-mean and group-mean of it (separation into within- and between-effects). In your case, if "treatment" is practically constant over time, you just include it "as is"
in your model.

The main reason for heterogeneity bias, if I understood Bell et al.
correctly, is the weighted average of coefficients for time-varying variables (or more general: level-1 predictors that have also a level-2 effect and thus might correlate with the group variable from the random effects). Simply decomposing time-varying predictors into their within- and between-effects indeed give you the same consistent estimates as a "fixed effects" model, just that the REWB model has much more benefits.

Based on a short blog post I found
(https://urldefense.proofpoint.com/v2/url?u=https-3A__shouldbewriting.netlify.com_posts_2019-2D10-2D21-2Daccounting-2Dfor-2Dwithin-2D&d=DwIFAw&c=7Q-FWLBTAxn3T_E3HWrzGYJrC4RvUoWDrzTlitGRH_A&r=t-hV_EQcvMxUUCFqXmGPFL3N6XmAH6-xWI5Xpn-HlYI&m=VAEa7Lfqyy-8PDDuGoMNHrwE_K1t6S14Mdc5eCWNdCI&s=asNSu6KdGLAPx10e6Zrk1NQaAnQzyJUr9SWWrihKz9Q&e=
and-between-subject-effect/) I have written a small gist that produces plots and coefficient tables for teaching repeated measurement with mixed models, which shows this:
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_strengejacke_c53e1fa1d7cf41e4737f3ab044a67d09&d=DwIFAw&c=7Q-FWLBTAxn3T_E3HWrzGYJrC4RvUoWDrzTlitGRH_A&r=t-hV_EQcvMxUUCFqXmGPFL3N6XmAH6-xWI5Xpn-HlYI&m=VAEa7Lfqyy-8PDDuGoMNHrwE_K1t6S14Mdc5eCWNdCI&s=Tn8PpuouPfun3hSncpr4T3MUMZq_Fg8iFBKc9NcDyxA&e= 

One thing I would take into consideration is the interaction term. There are several ways how to do this if a time-varying predictor is used in an interaction, I would not scale it (as in your example), but probably think if you're interested in the interaction of the within- or between-effect (or both). See the 'Details' in the "parameters::demean()" help for some more explanation and references (https://urldefense.proofpoint.com/v2/url?u=https-3A__easystats.github.io_parameters_reference_demean.html&d=DwIFAw&c=7Q-FWLBTAxn3T_E3HWrzGYJrC4RvUoWDrzTlitGRH_A&r=t-hV_EQcvMxUUCFqXmGPFL3N6XmAH6-xWI5Xpn-HlYI&m=VAEa7Lfqyy-8PDDuGoMNHrwE_K1t6S14Mdc5eCWNdCI&s=7U7xH0JS_kBGMIorBDahpL-x84Gae4UmyQ6TYi0HG_0&e= ).

To your 2nd question: See my gist above. FE models model the within-effect, however, this may (or: is very likely to) vary between group levels (i.e.
subjects). Thus, including the within-effect as random slope makes sense, since it captures the variability between groups (but leads to increased SE because it accounts better for the uncertainty in the random effects). See also this vignette:
https://urldefense.proofpoint.com/v2/url?u=https-3A__easystats.github.io_parameters_articles_demean.html&d=DwIFAw&c=7Q-FWLBTAxn3T_E3HWrzGYJrC4RvUoWDrzTlitGRH_A&r=t-hV_EQcvMxUUCFqXmGPFL3N6XmAH6-xWI5Xpn-HlYI&m=VAEa7Lfqyy-8PDDuGoMNHrwE_K1t6S14Mdc5eCWNdCI&s=U7045TSlmK52uGBY4fzPGhwaCLUpD-1QZdz9fyYmF8U&e= 

Best
Daniel

-----Ursprüngliche Nachricht-----
Von: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> Im Auftrag von Slaughter, Kelly
Gesendet: Montag, 13. April 2020 01:34
An: r-sig-mixed-models using r-project.org
Betreff: [R-sig-ME] Controlling for self-selection bias / endogeneity in mixed models

Hi all -

I have a concern regarding self-selection/omitted variable bias. I have a longitudinal/repeated measures model, theorizing about a relationship between treatment/control and effort, represented in nlme syntax as:

EQ 1) log(effort measured in time) ~ treatment*scale(experience), random = ~1|subject

Treatment/control is selected by the subject, it is not randomized, thus raising endogeneity concerns. My background is applied econ, so as I learn the mixed model domain, I expected to find the mixed model equivalent of instrumental variables/inverse Mills ratio, etc. Yet there is surprisingly (to me) limited material addressing this issue. The best reference material I found is in fact a thread in this mailing list from October 2016 and the papers referenced within, leading to Bell, Fairbrother, and Jones (2019). My first impression is that I should employ a within-between random effects (REWB)model -

EQ 2) log(effort measured in time) ~ treatment*scale(experience) + experience_between + experience_within, random = experience_within +
scale(experience) | subject

If I understand correctly, the intuition is that the addition of a group mean explanatory variable "breaks out" the variability that would be associated with an omitted variable / error term. Per Bell et al, "there can be no correlation between level 1 variables included in the model and the level 2 random effects...unchanging and/or unmeasured characteristics of an individual (such as intelligence, ability, etc.) will be controlled out of the estimate of the within effect."

So, no concern between the subject (level 2) and treatment (level 1) via REWB, wonderful!

Bell et al caution, "...in a REWB/Mundlak models, unmeasured level 2 characteristics can cause bias in the estimates of between effects and effects of other level 2 variables."

Not an issue for me - I am not concerned with level 2, I include subject to address the IID violation but am interested in population, not subject, performance.

Bell et al continue, "However, unobserved time-varying characteristics can still cause biases at level 1 in either an FE or a REWB/Mundlak model."

Though conceptually my treatment variable is time-varying (it can change across time within a subject), as a practical/empirical matter, the treatment is unchanging within the subject - subjects have no reason to change / would prefer to keep the choice constant. Of 80k records, treatment switches within a subject occur in about a dozen records.

So, I think I have my solution. However, if a reviewer is not happy with the with-in / between REWB solution (worried about the level 1 bias), I can further defend EQ 2 via its random coefficient/slope, if I understand the Oct 2016 thread correctly.

So, my questions are:

(1) Is the above correctly reasoned?

(2) If the random slope model is a further defense against self-selection bias, could someone provide an intuitive explanation as to why? Is the idea that by allowing slopes to vary, there is no endogeneity problem to solve as the very structure of the model makes the correlated errors concern irrelevant?

Other solutions I explore include a Mundlak model, but per Bell et al, the Mundlak models are not meaningful for repeated measures. Also, it appears that the brms package appears to support mixed modeling using instrumental variables, something I am more comfortable with per my background, but strong instrumental variables are hard to find in the wild!

Thank you! - Kelly

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models using r-project.org mailing list https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dmixed-2Dmodels&d=DwIFAw&c=7Q-FWLBTAxn3T_E3HWrzGYJrC4RvUoWDrzTlitGRH_A&r=t-hV_EQcvMxUUCFqXmGPFL3N6XmAH6-xWI5Xpn-HlYI&m=VAEa7Lfqyy-8PDDuGoMNHrwE_K1t6S14Mdc5eCWNdCI&s=tuEd99m5bw5OUB0RX6CZfHDZ5w2nTVzXy4d1wozIRRk&e= 

--

_____________________________________________________________________

Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; Gerichtsstand: Hamburg | www.uke.de
Vorstandsmitglieder: Prof. Dr. Burkhard Göke (Vorsitzender), Joachim Prölß, Prof. Dr. Blanche Schwappach-Pignataro, Marya Verdel _____________________________________________________________________

SAVE PAPER - THINK BEFORE PRINTING