[R-sig-ME] Perfectly correlated random effects (when they shouldn't be)
svm
steven.v.miller at gmail.com
Wed Jul 15 17:42:40 CEST 2015
Appreciate all the input. My preference would be for the minimal models for
concern of computation and overparameterization, but my field is one where
the reviewer is always right and junk models are encouraged. I do
appreciate the insight about the small number of regions. I think I had
been mistaken in my belief that the number of countries per region would
compensate even if the number of regions themselves were small.
On Wed, Jul 15, 2015 at 7:32 AM, Reinhold Kliegl <reinhold.kliegl at gmail.com>
wrote:
> Correction: ... a counterexample to the claim that maximal models are
> always conservative (not: anti-conservative).
>
>
> On Wed, Jul 15, 2015 at 1:25 PM, Reinhold Kliegl <
> reinhold.kliegl at gmail.com> wrote:
>
>> Actually, I overlooked that the region-related variance component of x2
>> can also be removed from the GLMM without loss of goodness of fit. Thus, I
>> end up with an intercept-only GLMM as the most parsimonious model for this
>> data set. Here is the anova() output:
>>
>> > anova(zcpM3, zcpM2, zcpM1, M1)
>> Data: data
>> Models:
>> zcpM3: y ~ x1 + x2 + (1 | country) + (1 | country:wave) + (1 | region)
>> zcpM2: y ~ x1 + x2 + (1 | country) + (1 | country:wave) + (1 + x2 ||
>> region)
>> zcpM1: y ~ x1 + x2 + (1 | country) + (1 | country:wave) + (1 + x1 + x2 ||
>> region)
>> M1: y ~ x1 + x2 + (1 | country) + (1 | country:wave) + (1 + x1 + x2 |
>> region)
>> Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
>> zcpM3 6 258193 258255 -129091 258181 (only
>> intercept, parsimonious GLMM)
>> zcpM2 7 258195 258267 -129091 258181 0.1882 1 0.6644
>> zcpM1 8 258197 258279 -129091 258181 0.0000 1 1.0000
>> M1 11 258201 258314 -129090 258179 1.8885 3 0.5959
>> (overparameterized max GLMM)
>>
>> There is another interesting aspect to this reduction of model
>> complexity. The parsimonious GLMM yields the SMALLEST z-value for the
>> significant fixed effect of x1. Thus, this is a counterexample to the claim
>> that maximal models are always anti-conservative. Whether this is the case
>> or not may depend on whether the maximal model is overparameterized or not.
>> Here are the corresponding fixed-effect estimates for x1 and x2:
>>
>> Fixed effects for x1 across GLMMs:
>> Estimate Std. Error z value Pr(>|z|)
>> zcpM3: x1 0.51004 0.23859 2.138 0.0325 * (only
>> intercept, parsimonious GLMM)
>> zcpM2: x1 0.515822 0.167005 3.089 0.00201 **
>> zcpM1: x1 0.516168 0.149704 3.448 0.000565 ***
>> M1: x1 0.48913 0.19972 2.449 0.0143 *
>> (overparameterized max GLMM)
>>
>> Fixed effects for x2 across GLMMs:
>> zcpM3: x2 -0.02743 0.15881 -0.173 0.8629 (only
>> intercept, parsimonious GLMM)
>> zcpM2: x2 -0.008715 0.151586 -0.057 0.95416
>> zcpM1: x2 -0.008701 0.155332 -0.056 0.955331
>> M1: x2 0.04422 0.17376 0.254 0.7991
>> (overparameterized max GLMM)
>>
>> On Wed, Jul 15, 2015 at 10:57 AM, Reinhold Kliegl <
>> reinhold.kliegl at gmail.com> wrote:
>>
>>> It looks like your model is overparameterized. Specifically, although
>>> you have a very large number of observations (N=212570), the data do not
>>> support the six model parameters you try to estimate for the random-effects
>>> term of region. You can see this quickly with the rePCA() function. There
>>> was also a convergence warning for me, but perhaps this was due to the fact
>>> that I worked with the full set (as provided in the dropbox file), whereas
>>> you used a subset.
>>>
>>> As a first step, I forced the correlation parameters to zero with the
>>> ||-syntax (zcpM1 GLMM below). A LRT suggests that dropping the three
>>> correlation parameters does not decrease the goodness of fit; Chi-sq(3) =
>>> 1.9, p=.5959. This could very well be a consequence of the small number of
>>> regions, as Jake pointed out; and even 18 regions may not be enough for a
>>> reliable estimation of three correlation parameters.
>>>
>>> Moreover, the rePCA() suggests that even the three remaining model
>>> parameters for the random-effects term of region are quite marginally
>>> supported by the data. In the zero-correlation parameter GLMM, the variance
>>> component for x1 was estimated to be suspiciously small. Therefore, in a
>>> second step, I removed the region-related x1 variance component . Again,
>>> according to a LRT there was no significant loss in goodness of fit;
>>> Chi-sq(1) = 0.
>>>
>>> The R script is inserted below. Please note that taking a correlation
>>> parameter or a variance component out of a model does not mean that it is
>>> zero, but that the data do not support a model that assumes that it is
>>> different from zero. zcpM2 (below) looks like an adequate parsimonious
>>> GLMM for these data.
>>>
>>>
>>> ###
>>>
>>> # Note: The original post used a subset of the data, but it is not clear
>>> how they were selected. I used all the data.
>>>
>>> library(lme4)
>>> library(RePsychLing)
>>> data <- read.csv("correlated-random-effects.csv", header=TRUE)
>>> str(data)
>>> data$country <- factor(data$country)
>>> data$region <- factor(data$region)
>>>
>>> # no convergence problem
>>> print(summary(M1 <- glmer(y ~ x1 + x2 + (1 | country) + (1 |
>>> country:wave) +
>>> (1 + x1 + x2 | region), data=data,
>>> family=binomial(link="logit"))))
>>>
>>> summary(rePCA(M1))
>>>
>>> # convergence problem
>>> print(summary(zcpM1 <- glmer(y ~ x1 + x2 + (1 | country) + (1 |
>>> country:wave) +
>>> (1 + x1 + x2 || region), data=data,
>>> family=binomial(link="logit"))))
>>> summary(rePCA(zcpM1))
>>>
>>> # ok
>>> print(summary(zcpM2 <- glmer(y ~ x1 + x2 + (1 | country) + (1 |
>>> country:wave) +
>>> (1 + x2 || region), data=data,
>>> family=binomial(link="logit"))))
>>> summary(rePCA(zcpM2))
>>>
>>> anova(zcpM2, zcpM1, M1)
>>>
>>> ####
>>>
>>>
>>>
>>>
>>> On Wed, Jul 15, 2015 at 3:45 AM, svm <steven.v.miller at gmail.com> wrote:
>>>
>>>> I considered that. I disaggregated the region random effect from 6 to 18
>>>> (the latter of which approximates the World Bank's region
>>>> classification).
>>>> I'm still encountering the same curious issue.
>>>>
>>>> Random effects:
>>>> Groups Name Variance Std.Dev. Corr
>>>> country:wave (Intercept) 0.1530052 0.39116
>>>> country (Intercept) 0.3735876 0.61122
>>>> wbregion (Intercept) 0.0137822 0.11740
>>>> x1 0.0009384 0.03063 -1.00
>>>> x2 0.0767387 0.27702 -1.00 1.00
>>>> Number of obs: 212570, groups: country:wave, 143; country, 82;
>>>> wbregion, 18
>>>>
>>>> For what it's worth: the model estimates fine. The results are
>>>> intuitive
>>>> and theoretically consistent. They also don't change if I were to remove
>>>> that region random effect. I'd like to keep the region random effect
>>>> (with
>>>> varying slopes) in the model. I'm struggling with what I should think
>>>> about
>>>> the perfect correlations.
>>>>
>>>> On Tue, Jul 14, 2015 at 9:07 PM, Jake Westfall <jake987722 at hotmail.com>
>>>> wrote:
>>>>
>>>> > Hi Steve,
>>>> >
>>>> >
>>>> > I think the issue is that estimating 3 variances and 3 covariances for
>>>> > regions is quite ambitious given that there are only 6 regions. I
>>>> think
>>>> > it's not surprising that the model has a hard time getting good
>>>> estimates
>>>> > of those parameters.
>>>> >
>>>> >
>>>> > Jake
>>>> >
>>>> > > Date: Tue, 14 Jul 2015 20:53:01 -0400
>>>> > > From: steven.v.miller at gmail.com
>>>> > > To: r-sig-mixed-models at r-project.org
>>>> > > Subject: [R-sig-ME] Perfectly correlated random effects (when they
>>>> > shouldn't be)
>>>> >
>>>> > >
>>>> > > Hi all,
>>>> > >
>>>> > > I'm a long-time reader and wanted to raise a question I've seen
>>>> asked
>>>> > here
>>>> > > before about correlated random effects. Past answers I have
>>>> encountered
>>>> > on
>>>> > > this listserv explain that perfectly correlated random effects
>>>> suggest
>>>> > > model overfitting and variances of random effects that are
>>>> effectively
>>>> > zero
>>>> > > and can be omitted for a simpler model. In my case, I don't think
>>>> that's
>>>> > > what is happening here, though I could well be fitting a poor model
>>>> in
>>>> > > glmer.
>>>> > >
>>>> > > I'll describe the nature of the data first. I'm modeling
>>>> individual-level
>>>> > > survey data for countries across multiple waves and am estimating
>>>> the
>>>> > > region of the globe as a random effect as well. I have three random
>>>> > effects
>>>> > > (country, country-wave, and region). In the region random effect, I
>>>> am
>>>> > > allowing country-wave-level predictors to have varying slopes. My
>>>> inquiry
>>>> > > is whether some country-wave-level contextual indicator can have an
>>>> > overall
>>>> > > effect (as a fixed effect), the effect of which can vary by region.
>>>> In
>>>> > > other words: is the effect of some country-level indicator (e.g.
>>>> > > unemployment) in a given year different for countries in Western
>>>> Europe
>>>> > > than for countries in Africa even if, on average, there is a
>>>> positive or
>>>> > > negative association at the individual-level? These
>>>> country-wave-level
>>>> > > predictors that I allow to vary by region are the ones reporting
>>>> perfect
>>>> > > correlation and I'm unsure how to interpret that (or if I'm
>>>> estimating
>>>> > the
>>>> > > model correctly).
>>>> > >
>>>> > > I should also add that I have individual-level predictors as well as
>>>> > > country-wave-level predictors, though it's the latter that concerns
>>>> me.
>>>> > > Further, every non-binary indicator in the model is standardized by
>>>> two
>>>> > > standard deviations.
>>>> > >
>>>> > > For those interested, I have a reproducible (if rather large)
>>>> example
>>>> > > below. Dropbox link to the data is here:
>>>> > >
>>>> >
>>>> https://www.dropbox.com/s/t29jfwm98tsdr71/correlated-random-effects.csv?dl=0
>>>> > >
>>>> > > In this reproducible example, y is the outcome variable and x1 and
>>>> x2 are
>>>> > > two country-wave-level predictors I allow to vary by region. Both
>>>> x1 and
>>>> > x2
>>>> > > are interval-level predictors that I standardized to have a mean of
>>>> zero
>>>> > > and a standard deviation of .5 (per Gelman's (2008) recommendation).
>>>> > >
>>>> > > I estimate the following model.
>>>> > >
>>>> > > summary(M1 <- glmer(y ~ x1 + x2 + (1 | country) + (1 |
>>>> country:wave) +
>>>> > (1 +
>>>> > > x1 + x2 | region), data=subset(Data),
>>>> family=binomial(link="logit")))
>>>> > >
>>>> > > The results are theoretically intuitive. I think they make sense.
>>>> > However,
>>>> > > I get a report of perfect correlation for the varying slopes of the
>>>> > region
>>>> > > random effect.
>>>> > >
>>>> > > Random effects:
>>>> > > Groups Name Variance Std.Dev. Corr
>>>> > > country:wave (Intercept) 0.15915 0.3989
>>>> > > country (Intercept) 0.32945 0.5740
>>>> > > region (Intercept) 0.01646 0.1283
>>>> > > x1 0.02366 0.1538 1.00
>>>> > > x2 0.13994 0.3741 -1.00 -1.00
>>>> > > Number of obs: 212570, groups: country:wave, 143; country, 82;
>>>> region, 6
>>>> > >
>>>> > > What should I make of this and am I estimating this model wrong?
>>>> For what
>>>> > > it's worth, the dotplot of the region random effect (with
>>>> conditional
>>>> > > variance) makes sense and is theoretically intuitive, given my
>>>> data. (
>>>> > > http://i.imgur.com/mrnaJ77.png)
>>>> > >
>>>> > > Any help would be greatly appreciated.
>>>> > >
>>>> > > Best regards,
>>>> > > Steve
>>>> > >
>>>> > > [[alternative HTML version deleted]]
>>>> > >
>>>> > > _______________________________________________
>>>> > > R-sig-mixed-models at r-project.org mailing list
>>>> > > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Steven V. Miller
>>>> Assistant Professor
>>>> Department of Political Science
>>>> Clemson University
>>>> http://svmiller.com
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>
>>>
>>>
>>
>
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list