[R-sig-ME] Repeated measures in lmer().
Douglas Bates
bates at stat.wisc.edu
Thu Apr 2 22:31:54 CEST 2009
On Wed, Apr 1, 2009 at 8:47 PM, Rolf Turner <r.turner at auckland.ac.nz> wrote:
> Yesterday I posted a query with subject line ``Puzzled by simple example.''
> The puzzlement was really due to a clumsy stupidity on my part, about
> which the less said the better. :-)
> The example that I was (initially) puzzled by was really a preliminary
> to getting to the real issue with which I am trying to come to grips.
> This is the issue of fitting repeated measures models using lmer().
> The model that was fitted in my yesterday's posting was
> lmer(y ~ age + (1 | child),data=ht.dat)
> This assumes, as I far as I understand things (which is admittedly not
> very far) a constant diagonal covariance matrix for the within-child
> random effect. This is a pretty unrealistic assumption. Doug Bates
> kindly informed me, many months ago, that to get a general positive
> definite covariance matrix I should use the syntax
> lmer(y ~ age + (age | child),data=ht.dat) # (*)
> but that the model that lmer would be attempting to fit thereby would be
> singular.
> It has taken me this many months to get even a start at clearing away the
> cobwebs from my thinking so that I could understand this, but I now think
> I'm getting to square zero at least.
>
> The model being fitted by (*) above is
>
> Y = X beta + Z_1 b_1 + Z_2 b_2 + E
>
> but really the ``Z b'' terms (child effect + age-by-child interaction)
> can be combined into a single term so the model becomes just
>
> Y = X beta + Z b + E
>
> which is written entry by entry as
>
> y_ij = beta_i + B_ij + E_ij
>
> where the E_ij are all i.i.d. N(0,sigma^2_E) and the B_ij are independent
> of the E_ij, likewise Gaussian with mean 0, but each B_j = (B_1j,...,B_4j)'
> has covariance matrix Sigma (4-x-4, positive definite, but with no other
> specified structure).
>
> The reason the model is singular is that you can ``blend'' as much or
> as little as you choose of sigma^2_E into the diagonal of Sigma and have
> the same model. The miraculous thing is that despite the singularity,
> lmer() gives the (or ``a'') correct answer, at least in the simulated
> data experiments that I've tried. E.g.:
>
> #
> # Script scr.02
> #
>
> library(MASS)
> library(lme4)
>
> # Generate the data. No attempt is made to make these
> # data realistic.
> NCHILD <- 20
> set.seed(42)
> Beta <- c(40,50,60,70)
> Sigma <- 2*matrix(c(1.00,0.50,0.25,0.10,
> 0.50,1.00,0.50,0.25,
> 0.25,0.50,1.00,0.50,
> 0.10,0.25,0.50,1.00),ncol=4)
> MB <- mvrnorm(NCHILD,rep(0,4),Sigma)
> B <- as.vector(t(MB))
> E <- rnorm(4*NCHILD,0,0)
> Y <- Beta + B + E
> ht.dat.2 <-
> data.frame(y=Y,age=factor(rep(4:7,NCHILD)),child=factor(rep(1:NCHILD,each=4)))
>
> # Fit with lmer():
> f2 <- lmer(y ~ 0 + age + (0+age|child),data=ht.dat.2,REML=TRUE)
> V <- VarCorr(f2)
> M <- V[[1]]
> S <- attr(V,"sc")
> covb.lmer <- M + diag(rep(S^2,4))
> attributes(covb.lmer) <- attributes(covb.lmer)["dim"]
> covb.chk <- var(MB)
>
> # Clean up:
> rm(NCHILD,Beta,Sigma,MB,B,E,Y,V,M,S)
>
> You see that if you take the residual variance component and add it to
> the diagonal of the ``child'' covariance matrix (``Sigma-hat'') you get
> (very close to) exactly the right answer. For NCHILD = 20 as in the
> foregoing example we get:
>
>> range(covb.lmer-covb.chk)
> [1] -1.288370e-04 1.467542e-05
>
> Without even a warning. If we jack NCHILD up to 500 we get a warning
> about false convergence (well, I did yesterday, but when I sourced
> scr.02 just now, I didn't --- ???) but the agreement is even better:
>
>> range(covb.lmer-covb.chk)
> [1] -6.019627e-06 6.337365e-06
>
> My question is: Is this behaviour reliable? Can I depend on this sort
> of work-around for fitting a repeated measures model where there is
> no replication, i.e. one observation per (age,child) combination, whence
> no room for an E_ij term in the model? Suppose the repeated measures
> model is complicated by other factors. E.g. instead of all girls we
> have both sexes and put a sex factor into the model. Or the children
> are nested within localities or countries. Can I still fit the model(s)
> in this way?
>
> ``Ideally'' I'd like to be able to fit the model
>
> y_ij = beta_i + B_ij
>
> i.e. suppress the residual error E_ij term (or lump it in with the
> child random effect). This is presumably not possible as lmer() is
> currently constituted. Would it/could it be possible? I.e. could the
> code be adjusted to accommodate this desideratum? Would it be a
> coding nightmare to do this? It would seem to me to a fairly
> *desirable* feature, since repeated measures data with no
> replications are not at all uncommon. But of course desirable and
> feasible are two very different things.
>
> On another tack: Have I (at last) got hold of the right end of the
> stick in respect of such models? Or is my understanding still flawed?
> Am I going at fitting repeated measures models in the right way?
> Is there another --- better --- way to do this using lmer()?
>
> I have tried a similar simulation experiment with two replicate observations
> for each child. (Height measured twice on each occasion; the E_ijk, k=1,2,
> now representing measurement error.) The answers appeared to make sense
> although in these circumstances one cannot do an exact check since the
> estimate of ``Sigma'' produced by lmer() will not be the same as var(MB)
> *because of* the measurement error.
>
> Comments? Suggestions? Enlightenment? Feedback? Such from the horse's
> mouth (i.e. Doug Bates) would be particularly welcome.
I'm glad to see the word "mouth" in that sentence. Occasionally I
have been described with reference to another part of a horse's
anatomy. :-)
I appreciate your difficulty in phrasing a model such as this for
lmer. In fact, I don't think one could reliably fit the model that
you want to fit using lmer.
If I may be so bold, I would suggest that the fault is more in the
model formulation than in the software.
You are taking what may be termed a "classical" approach to repeated
measures data, specifically longitudinal data. You require that the
data be expressible as a subjects by occasions table and essentially
extracting means and covariances from the columns of that table.
Sometime we refer to that organization is the wide format, as opposed
to the long format where each row corresponds to a single observation
with covariates of subject and occasion. The key is that you are
regarding age as an unordered categorical variable.
The wide format view works fine until it doesn't. When I first
started looking at longitudinal data I saw discussions of what to do
about missing data or what to do if the nominal ages are 10, 11 and 12
years but you actually see the subject at ages 10.10, 10.92 and 12.03
years. If you think of putting the data into the long format these
questions don't come up. If you are missing an observation then you
delete that row. If different subjects are recorded at different ages
then so be it. Record the ages at which you actually saw them.
Then examine the data, in my case I would use a lattice plot such as
the enclosed, to see what a typical within-subject trajectory is. The
plot is generated with the enclosed scripts and some models are fit.
The data are balanced with respect to the number of occasions at which
the subject's height is measured but the actual ages are somewhat
unbalanced. One can go ahead and fit a model to these data using age
as a covariate, even though there are 14, not 9, unique ages.
The model you want to fit would have 9 distinct random effects for
each subject, which would be a saturated model. I would claim that
you almost never need a saturated model like that. Here I have
allowed for fixed and random effects to a third-order polynomial but
even that is probably stretching the point. Looking at the data plot
doesn't convince me that fitting cubic terms has practical
significance, even if it is statistically significant.
As I write this I hear George Box's voice extolling the virtues of
parsimony in a model. Whenever you have a covariate like time I think
it is a waste to convert it to a categorical covariate, even if
everyone was measured at exactly the same times or ages.
The bottom line is that you can't fit a saturated linear mixed model
with lmer reliably because lmer will always throw in one variance
parameter in addition to those generated by the random-effects terms.
So my advice is "Don't do that." :-)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Oxboys.pdf
Type: application/pdf
Size: 50046 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20090402/15ba152f/attachment.pdf>
More information about the R-sig-mixed-models
mailing list