[R-sig-ME] In simple terms, how is the estimated variance of higher-level effects calculated?

Tue Jul 17 20:55:40 CEST 2012

On Tue, Jul 17, 2012 at 12:13 PM, Jeremy Koster <helixed2 at yahoo.com> wrote:
> Thanks for the feedback, David.  Owing to my lack of expertise, I confess that I didn't completely follow everything, but your email inspired me to explore the varying intercepts with different exponential families, focusing particularly on the gaussian.
>
> In short, my students were looking at the estimated varying intercepts for each higher-level group (or the "BLUP's", as some people seem to call them) -- the intercepts that one can see by entering:

As Alan James once said, "these values are just like the BLUPs - Best
Linear Unbiased Predictors - except that they aren't linear and they
aren't unbiased and there is no clear sense in which they are "best",
but other than that ..."

The way I think of the model there are two vector-valued random
variables, the response, Y, and the random-effects vector, B.  The
model defines the conditional distribution of Y, given B, through a
linear predictor expression, a link function and a distribution.  It
also defines the unconditional distribution of B as multivariate
Gaussian with mean 0 and a parameterized variance-covariance matrix,
Sigma.  Given these two distributions we can evaluate the joint
distribution and the conditional distribution of B, given Y.  In
practice it is tricky to evaluate the conditional distribution of B
given the observed value of Y because you need to integrate the
unscaled conditional density.  However, you can find the conditional
mode, which is the value of B that maximizes the unscaled conditional
density, without needing to integrate.  It is these conditional modes
that are returned by ranef().

> ranef (fm1)
>
> Their suggestion was that, if one calculates the variance of that vector of estimated intercepts, then one should get the number that gets reported by lme4 for the higher-level variance (for example, 0.93537 in my earlier email).
>
> We did this for several two-level models, and although the variance we calculate from the group-specific intercepts is always in the same ballpark as the lme4 output, it's never on target.

Generally you will find that the variance of the values returned by
ranef is less than the estimated variance because of shrinkage.

> So I've been scouring all the resources I have at hand: Gelman and Hill, Ben Bolker's book, Snijders and Bosker, etc.
>
> Most of these seem to say: "And this is the variance estimate and how to interpret it" . . . without going into non-technical details about how exactly it's calculated.  That's where we're stuck.  I know that the intercepts are calculated via partial pooling and a shrinkage factor, but it's not clear how that relates to the estimated variance.

>
>
>
> ----- Original Message -----
> From: David Duffy <David.Duffy at qimr.edu.au>
> To: Jeremy Koster <helixed2 at yahoo.com>
> Cc: r-sig-mixed-models at r-project.org
> Sent: Monday, July 16, 2012 6:43 PM
> Subject: Re: [R-sig-ME] In simple terms,how is the estimated variance of higher-level effects calculated?
>
> On Mon, 16 Jul 2012, Jeremy Koster wrote:
>
>> I'm teaching some grad students about mixed-effects modeling. To their credit, they're paying close attention and asking good questions.
>>
>> Today, we were talking about variance components in a basic two-level binomial glmer with no fixed effects.
> [...]
>> So if one were to describe in simple terms how lme4 generates a number for the estimated variance of the random effects, what might be said?
>
> I think conceptualizing it as a latent variable model helps.  Since the latent variables are unobserved, we make inferences about their distribution based upon the distribution of the manifest variables and our assumptions about the nature of the latent variable distribution.
>
> Different assumed latent variable distributions eg beta, normal, mixtures - and different link functions eg logit, probit, log, identity - will change not only your variance estimates, but your interpretation.
>
> One useful exercise might be to simulate binary data from a threshold model, and demonstrate how it is that the variances of the (known) latent variables are estimated (in a probit-normal model), and how the tetrachoric correlation, Pearson correlation and odds ratio for a 2x2 table vary by marginal probabilities and association strength.
>
> You might also compare different models for this "classic"
> boric acid teratogenicity dataset:
>
> http://genepi.qimr.edu.au/staff/davidD/Sib-pair/Documents/Using_Sib-pair/Scripts/boricex.in
>
> A final example might be to look at the commonly used approach of fitting a LMM to binary data coded as 1's and 0's (going back to Cochrane 1943), and whether results are deceptive or not.  In analysis of Genome Wide Association Scan data for a binary phenotype Y, we test the (fixed) effect of each measured polymorphism X (usually scored as 0,1,2) against Y, but we need to adjust for confounding due to unobserved relatedness of individuals in the study. The latter is estimated as an NxN empirical kinship matrix (the average pairwise correlation over M polymorphisms between N study participants, with M=2000000 to 5000000, and N = 1000 to 100000).  When Y is continuous, a LMM is a very attractive approach...
>
> -- | David Duffy (MBBS PhD)                                         ,-_|\
> | email: davidD at qimr.edu.au  ph: INT+61+7+3362-0217 fax: -0101  /     *
> | Epidemiology Unit, Queensland Institute of Medical Research   \_,-._/
> | 300 Herston Rd, Brisbane, Queensland 4029, Australia  GPG 4D0B994A v
>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models