[R-sig-ME] Bernoulli glmm question.
Rolf Turner
r.turner at auckland.ac.nz
Thu Mar 13 02:55:38 CET 2014
I am trying to help a graduate student in linguistics analyse her data.
Very much a case of the blind leading the blind, but I gotta try!
Summary of the structure of the data:
A number of (Mandarin speaking) students are assessed on their
pronunciation of a suite of "test items" --- English language words.
(E.g. umbrella, helicopter, knife.) They are assessed phoneme by
phoneme in each word. The response, at least in the context of my
question, is whether they got the pronunciation right (y = 1) or wrong
(y = 0).
The phonemes are classified into 7 types:
* Initial consonant
* Medial consonant
* Final consonant
* Initial consonant cluster
* Medial consonant cluster
* Final consonant cluster
* vowel
The students are classified by sex ("gender" to those wimps who are too
embarrassed to say the word "sex").
I thought to fit a Bernoulli model with "type" (of phoneme) and sex (of
the student) as predictors, with "student" being a random effect.
The syntax that I tried was:
fit <- lmer(y ~ sex + type + (1 | student), family = binomial, data = X)
where "X" is a data frame containing the relevant variables.
Main effects only, no interactions, so as to keep things simple --- at
least initially.
First impressions from the fit: Girls do significantly better than
boys, and vowels are significantly easier than final consonant clusters
(which form the baseline) and initial and medial clusters are
significantly harder for the kids to pronounce than are the final
clusters. Single consonants (initial, medial, and final) do not differ
significantly from the baseline in their difficulty level.
The bit about vowels being easier conforms to the graduate student's
expectations and is kind of obvious from a rough inspection of the data.
There are 50 "test items" (words). In the data set that I am initially
looking at there are 54 students. There are a total of 10314 observations.
(I am just looking at the oldest group of students to start with. There
are 6 other groups and eventually I will put all 7 groups together and
investigate an age (or "level") effect as well.)
Would anyone be kind enough to comment on my efforts so far? Please try
not to be too rude! :-) Am I on the right track? Am I overlooking any
glaring traps for young players? Have I got the syntax of my call to
lmer() correct?
One thing that I am nervous about:
If I fit the "trivial model"
fit0 <- lmer(y ~ 1 + (1 | student), family = binomial, data = X)
the resulting coefficients are just the estimates (BLUPs?) of the
"random intercepts, is it not so? If I calculate the variance of these
coefficients:
var(coef(fit0)$student[,1])
I get 0.0226. I thought that this value would be "pretty similar to"
(though not exactly the same as) the estimated random effect variance.
But the latter is 0.0502 --- which seems to me to be quite different.
A 95% confidence interval for sigma^2 on the basis of my "var(coef ...)"
calculation, assuming that (n-1)*s^2/sigma^2 ~ chi-squared_{n-1},
is [0.0160, 0.0345] (to 4 decimal places) so the estimated random effect
variance from fit0 is "significantly different" from my naive estimate.
My thinking must be out to lunch here. Can someone put me back on the
rails. (My humblest apologies for the mixed metaphors. :-) )
Thanks for any words of wisdom.
cheers,
Rolf Turner
More information about the R-sig-mixed-models
mailing list