[R-sig-ME] Dealing with NAs in LMER with longitudinal data (Re Crime and Education data)

Mon Sep 16 05:27:00 CEST 2019

I’ve often heard that mixed-effect lmer/glmer models “handle” or “deal” with NA values well, and I’ve become more curious about what this actually means, if it is, indeed, true. What I’ve observed working with mixed-effect models is that na.omit will delete the entire row of observations, and depending on the number of NAs, the AIC might deceptively, dramatically decrease, given that the sample is smaller.

I know that one can also use “na.pass”—maybe this is what I’ve heard in the past with regard to lmer handling NAs well(?)—though I’ve often found that this doesn’t always work, throwing back the error that ```Error in qr.default(X, tol = tol, LAPACK = FALSE) : NA/NaN/Inf in foreign function call (arg 1)```. When it does work, I’m not sure how it works. I looked through the lme4 manual and the “Fitting Linear Mixed-Effects Models” article, but I couldn’t find anything.

I’d assume that imputation is better practice for handling NAs. Though, specifically referencing my crime/ed analysis (I’ve posted the data here: https://drive.google.com/open?id=1wRwLqCKNfpz5aHtyy5KfY07_RFqWsWv9) this is a bit more difficult, and something I have yet to do. I’ve been reading about it here: https://stefvanbuuren.name/fimd/sec-rastering.html.

In addition, there are instances where data is only offered every five years, or, as is the case with a presidential election, every four years. My “bandaid” approach for this kind of data pitfall is to stagger the four years, so that the election data counts for the two years preceding and the two years following the election (this is an assumption, but it seems preferable to NAs for three out of four years). 

Still, it seems that weirdness might be accompanying this method. Looking at educational attainment data (averaged over a five-year period) in the dataset, there exists unseemly high correlation between year and the proportion of people in a place and their corresponding educational attainment (some high school, hs diploma, some college, bachelors, MA,etc.); these individual variables have anywhere from a  -.5 to a .6 correlation with year. 

Code for looking at correlations:
```cor.total.years.city <- total.years.city.select%>%select((3), (8:31))%>%na.omit()
cor1 = cor(cor.total.years.city)
corrplot.mixed(cor1, lower.col = "black", number.cex = .7)```

Perhaps I should put these variables into into long format, but I’ve read that sometimes this exacerbates multi-collinearity. (And this wouldn’t solve the correlation strangeness)

To summarize: 
1. If lmer does handle NAs well, how exactly is it doing that? If “na.pass” fails, then is it handling NAs as any other program?
2. Is imputation (done correctly) better than allowing mixed-effect functions to handle NAs?
3. Any specific resources on imputing longitudinal data?
4. For data offered every four years, is my method of staggering (and filling) this data sufficient? Is there another way I should be thinking about this in lme4? Is this the source of funky correlations between education attainment and year?
5. Should I be using long format here for variables like race (black, white, asian, latino) and education attainment (some high school, hs diploma, some college, bachelors, MA/grad school)

Thanks much!

James