[R-sig-ME] Controlling for self-selection bias / endogeneity in mixed models

Mon Apr 13 01:34:02 CEST 2020

Hi all -

I have a concern regarding self-selection/omitted variable bias. I have a longitudinal/repeated measures model, theorizing about a relationship between treatment/control and effort, represented in nlme syntax as:

EQ 1) log(effort measured in time) ~ treatment*scale(experience), random = ~1|subject

Treatment/control is selected by the subject, it is not randomized, thus raising endogeneity concerns. My background is applied econ, so as I learn the mixed model domain, I expected to find the mixed model equivalent of instrumental variables/inverse Mills ratio, etc. Yet there is surprisingly (to me) limited material addressing this issue. The best reference material I found is in fact a thread in this mailing list from October 2016 and the papers referenced within, leading to Bell, Fairbrother, and Jones (2019). My first impression is that I should employ a within-between random effects (REWB)model -

EQ 2) log(effort measured in time) ~ treatment*scale(experience) + experience_between + experience_within, random = experience_within + scale(experience) | subject

If I understand correctly, the intuition is that the addition of a group mean explanatory variable "breaks out" the variability that would be associated with an omitted variable / error term. Per Bell et al, "there can be no correlation between level 1 variables included in the model and the level 2 random effects...unchanging and/or unmeasured characteristics of an individual (such as intelligence, ability, etc.) will be controlled out of the estimate of the within effect."

So, no concern between the subject (level 2) and treatment (level 1) via REWB, wonderful!

Bell et al caution, "...in a REWB/Mundlak models, unmeasured level 2 characteristics can cause bias in the estimates of between effects and effects of other level 2 variables."

Not an issue for me - I am not concerned with level 2, I include subject to address the IID violation but am interested in population, not subject, performance.

Bell et al continue, "However, unobserved time-varying characteristics can still cause biases at level 1 in either an FE or a REWB/Mundlak model."

Though conceptually my treatment variable is time-varying (it can change across time within a subject), as a practical/empirical matter, the treatment is unchanging within the subject - subjects have no reason to change / would prefer to keep the choice constant. Of 80k records, treatment switches within a subject occur in about a dozen records.

So, I think I have my solution. However, if a reviewer is not happy with the with-in / between REWB solution (worried about the level 1 bias), I can further defend EQ 2 via its random coefficient/slope, if I understand the Oct 2016 thread correctly.

So, my questions are:

(1) Is the above correctly reasoned?

(2) If the random slope model is a further defense against self-selection bias, could someone provide an intuitive explanation as to why? Is the idea that by allowing slopes to vary, there is no endogeneity problem to solve as the very structure of the model makes the correlated errors concern irrelevant?

Other solutions I explore include a Mundlak model, but per Bell et al, the Mundlak models are not meaningful for repeated measures. Also, it appears that the brms package appears to support mixed modeling using instrumental variables, something I am more comfortable with per my background, but strong instrumental variables are hard to find in the wild!

Thank you! - Kelly

	[[alternative HTML version deleted]]