[R-sig-ME] Use of mixed models when the causal relations are unclear?

Mon Feb 20 01:38:49 CET 2012

When I was looking at official statistics for level of education, for
native vs migrant populations in different areas of a city I'm
currently studying, it struck me that I could use mixed models to back
up an initial observation I did. After conducting the analysis, I was
unsure whether or not this was a proper thing to do. Perhaps you can
give some judgement on this?

Here are some rows from two tables of official statistics that gave me the idea:

native.education (educational level, measured in years)
            Area  -8    9   11   12   14   15+  NA   SUM
1       Gunnared 461 1772 1721 1427  557   532 104  6574
2     Lärjedalen 443 1568 1755 1587  802   830  84  7069
...
15        Styrsö 242  449  648  536  437   686  23  3021

migrant.education
            Area   -8    9   11   12   14  15+   NA  SUM
1       Gunnared 1474 1627 2166 1723  988 1097  817 9892
2     Lärjedalen 1839 1667 1947 1668  945  918 1008 9992
...
15        Styrsö    7   17   27   16   25   47   13  152

In the area "Styrsö", being migrant was associated with having a higher
level of education than the natives in the area. The same applies to
the area "Gunnared", but for the area "Lärjedalen" the reverse seems
to hold. I used lmer in lme4 to test my hypothesis.

Based on the tables I created a data.frame which you can get here:

print(load(url("http://code.cjb.net/temp/dotplottest.RData")))
[1] "test.df"
> str(test.df)
'data.frame':	362319 obs. of  3 variables:
 $ area     : Factor w/ 21 levels "Gunnared","Lärjedalen",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ native   : Factor w/ 2 levels "yes","no": 1 1 1 1 1 1 1 1 1 1 ...
 $ education: Ord.factor w/ 6 levels "Folkskola"<"Grundskola"<..: 1 1 1 1 1 1 1 1 1 1 ...

(I translated the labels of the education factor into a rough estimate
in years in the tables above, since the labels are only meaningful for
speakers of the swedish language)

I fitted the following model on the data:

library(lme4)
my.fit <- lmer(education ~ 1 + (native | area), data = test.df)

and I got an nice graph:

dotplot(ranef(my.fit, postVar = T))

The question I have is this: if education is a factor in the selection
of which area into which a migrant will move, then education is not
dependent on the area, and the model is not "true".

While the causal relations in this case thus are unclear, or mixed, is
it still reasonable to use the mixed model as I did to get proper
confidence intervals, for the mere correlation/association between
area, nativeness and educational level?