[R-sig-ME] Random effects of logistic regression: bias towards the mean?

Tue Mar 25 17:39:33 CET 2014

Dear Tibor,

Thanks for your interest, of course I'm happy to elaborate.

First, note that the 'hit' in this case does not mean anything, I'm just
trying to simulate data. In my own psycholinguistic eye-tracking
research, it would be if a participant looks at a target region of the
screen; it could also be if a participant responds correctly, but here
it is just a binomial dependent variable with different probability of
being 1 (true, or 'hit', or 'correct', but that is just interpretation
of in fact meaningless data I generated with a randomiser).

So, I made datasets with different probabilities of hits, all based on a
dataset without a dependent variable, but with a column p (e.g.
participant) and a column i (item), a column cond (condition). This
dataset pretends to be a nicely designed experiment in which 40
p(articipants) receive 40 i(tems) that can come in four conditions
(A,B,C,D); each p receives 10 i's in condition A, 10 in B etc. and the
items are rotated through p's.

In each simulated analysis, I added a dependent variable outcome, with a
probability of a hit (i.e. the value 1). This probability varies (I have
walked through different ones in the simulations), but below you can see
one simulation in which the probability is 0.1, for the whole column
outcome (so without relation to the dependent variable)
> rbinom(1600, 1, 0.1) -> dataset$outcome

> summary(dataset)
        p              i        cond       outcome
  1      :  40   1      :  40   A:400   Min.   :0.00000
  2      :  40   2      :  40   B:400   1st Qu.:0.00000
  3      :  40   3      :  40   C:400   Median :0.00000
  4      :  40   4      :  40   D:400   Mean   :0.09375
  5      :  40   5      :  40           3rd Qu.:0.00000
  6      :  40   6      :  40           Max.   :1.00000
  (Other):1360   (Other):1360

Indeed, the mean value is close to 0.10, 9.375% of the data has outcome
= 1 and the rest outcome = 0.

The model is made as follows:
> glmer(outcome ~ cond + (1|i) + (1|p), data=dataset, family=binomial) -> m

with the output:
> m
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) ['glmerMod']
  Family: binomial ( logit )
Formula: outcome ~ cond + (1 | i) + (1 | p)
    Data: dataset
       AIC       BIC    logLik  deviance  df.resid
1005.1915 1037.4581 -496.5958  993.1915      1594
Random effects:
  Groups Name        Std.Dev.
  i      (Intercept) 0.1881
  p      (Intercept) 0.2185
Number of obs: 1600, groups: i, 40; p, 40
Fixed Effects:
(Intercept)        condB        condC        condD
     -2.3476      -0.1291       0.1723       0.1172

Now, the random effect on the intercept for p is on average:
> colMeans(ranef(m)$p)
(Intercept)
0.003661209

.. and for i:
> colMeans(ranef(m)$i)
(Intercept)
  0.00271321

I would expect these random intercepts to be 0. Of course, one analysis
is not enough to conclude there is something wrong, but after iterating
the process I just described 2000 times for each probability of a 1 for
the dependent variable outcome, I still got an average value that was
higher the lower the probability is; the average random effects cross
zero when the hit probability crosses chance (0.5). The problem is that
(as dependent variable = fixed effect + random effect) a random effect
estimate that does not average zero makes the fixed effect estimate
different.

I hope this elaboration helps, but do not hesitate to ask more details!

Kind regards,

Tom

On 25-03-14 09:06, Tibor Kiss wrote:
> Dear Tom,
>
> it is hard to answer your question without the actual output of your model (not of your simulation), i.e. print(summary(model), corr = F). It is also not clear to me what you are actually measuring, i.e. what the "hit" should be. Perhaps you can elaborate.
>
>
> With kind regards
>
> Tibor
>
>
> Am 25.03.2014 um 08:54 schrieb Tom Lentz:
>
>> Dear all,
>>
>> The following question might be due to my poor understanding of logistic regression, in which case I would be very grateful for an explanation or a pointer to reading material.
>>
>> With my current understanding I think that logistic regression as typically done with lmer and family="binomial" (actually calls glmer, as calling lmer is now deprecated) behaves in an unexpected way, because it does not make random effects be near zero but moves them towards chance, i.e. towards positive values if the probability of a hit is below 0.5 and towards negative values if the probability of a hit is above 0.5. At first I thought this was shrinkage, but it does not happen if data is aggregated and a normal linear mixed model fitted to percentages, but I think that is ugly and should lead to worse or equal results, not better ones, because the percentages cannot be normally distributed, especially if they are far from chance.
>>
>> I have discovered this issue with the analysis of eye-tracking data, in which the chance of looking at the target was around 0.25, but the fixed effects in my model were lower than the mean and the random effects for participant and item were not around zero (hence, participants tend to be better than the fixed effect/average and items generally tend to be recognised better than on the fixed effects predicted/average). The result is that the fixed effects estimates are not at the average values, but lower.
>>
>> As my data set might have had a poorly understood conspiracy in it, I simulated data. Every simulated data set had 40 participants and 40 items (easy if you make it up!), but no effect of fixed effects; there was a condition (A, B, C or D) but the outcome was not influenced by this condition. The dependent variable was drawn with rbinom(1600, 1, probability), where probability was varied: 0.1, 0.15, 0.2 up till 0.9.
>>
>> For each probability I ran 2000 analyses with this formula:
>> lmer(outcome ~ cond + (1|i) + (1|p), data=dataset, family = "binomial")
>> and looked at the random effects for item and participants. Indeed, the lower the hit rate (the probability of the dependent variable outcome being TRUE or 1), the higher the average random effect, with a zero average for the random effects only at a 0.5 probability (or 0 logit). A plot can be found at <http://www.hum.uu.nl/medewerkers/t.o.lentz/plotRanefsR3.pdf>.
>>
>> The fixed effect of cond should not be significant, as the data is made up without regard to it. Indeed, at an alpha of 0.05 a spurious significant effect was only found in 4,2 % of the simulations. So, the analyses are not causing errors for hypothesis testing, but the estimates of the random effects are off. Is there a good explanation or is this unexpected behaviour?
>>
>> Version information: I have detected the problem a while ago, still in R 2, but it still happens in R 3.0.3 with lme4 version 1.1-5.
>>
>> Thanks in advance for your help!
>>
>> Kind regards,
>>
>> Tom
>>
>> TO Lentz PhD
>> Postdoctoral Researcher,
>> Parsing and Metrical Structure: Where Phonology Meets Processing
>>
>> Utrecht Institute of Linguistics OTS
>> Utrecht University
>> Trans 10
>> 3512 JK Utrecht
>> Netherlands
>>
>> _______________________________________________
>> R-sig-mixed-models op r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>