[R-sig-ME] Use of mixed models when the causal relations are unclear?
Hans Ekbrand
hans at sociologi.cjb.net
Mon Feb 20 01:38:49 CET 2012
When I was looking at official statistics for level of education, for
native vs migrant populations in different areas of a city I'm
currently studying, it struck me that I could use mixed models to back
up an initial observation I did. After conducting the analysis, I was
unsure whether or not this was a proper thing to do. Perhaps you can
give some judgement on this?
Here are some rows from two tables of official statistics that gave me the idea:
native.education (educational level, measured in years)
Area -8 9 11 12 14 15+ NA SUM
1 Gunnared 461 1772 1721 1427 557 532 104 6574
2 Lärjedalen 443 1568 1755 1587 802 830 84 7069
...
15 Styrsö 242 449 648 536 437 686 23 3021
migrant.education
Area -8 9 11 12 14 15+ NA SUM
1 Gunnared 1474 1627 2166 1723 988 1097 817 9892
2 Lärjedalen 1839 1667 1947 1668 945 918 1008 9992
...
15 Styrsö 7 17 27 16 25 47 13 152
In the area "Styrsö", being migrant was associated with having a higher
level of education than the natives in the area. The same applies to
the area "Gunnared", but for the area "Lärjedalen" the reverse seems
to hold. I used lmer in lme4 to test my hypothesis.
Based on the tables I created a data.frame which you can get here:
print(load(url("http://code.cjb.net/temp/dotplottest.RData")))
[1] "test.df"
> str(test.df)
'data.frame': 362319 obs. of 3 variables:
$ area : Factor w/ 21 levels "Gunnared","Lärjedalen",..: 1 1 1 1 1 1 1 1 1 1 ...
$ native : Factor w/ 2 levels "yes","no": 1 1 1 1 1 1 1 1 1 1 ...
$ education: Ord.factor w/ 6 levels "Folkskola"<"Grundskola"<..: 1 1 1 1 1 1 1 1 1 1 ...
(I translated the labels of the education factor into a rough estimate
in years in the tables above, since the labels are only meaningful for
speakers of the swedish language)
I fitted the following model on the data:
library(lme4)
my.fit <- lmer(education ~ 1 + (native | area), data = test.df)
and I got an nice graph:
dotplot(ranef(my.fit, postVar = T))
The question I have is this: if education is a factor in the selection
of which area into which a migrant will move, then education is not
dependent on the area, and the model is not "true".
While the causal relations in this case thus are unclear, or mixed, is
it still reasonable to use the mixed model as I did to get proper
confidence intervals, for the mere correlation/association between
area, nativeness and educational level?
More information about the R-sig-mixed-models
mailing list