[R-sig-ME] Teaching Mixed Effects

Mon Jan 26 18:15:15 CET 2009

> -----Original Message-----
> From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-
> models-bounces at r-project.org] On Behalf Of Ben Bolker
> Sent: Thursday, January 22, 2009 4:42 PM
> To: R Mixed Models
> Subject: Re: [R-sig-ME] Teaching Mixed Effects
> 
> 
>   Wow.
>   Two very small points:
> 
> >
> > I feel that the likelihood ratio is a perfectly reasonable way of
> > comparing two model fits where one is a special case of the other.
> In
> > fact, if the models have been fit by maximum likelihood, the
> > likelihood ratio would, I think, be the first candidate for a test
> > statistic.  The problem with likelihood ratio tests is not the
> > likelihood ratio, per se -- it is converting the likelihood ratio to
> a
> > p-value.  You need to be able to evaluate the distribution of the
> > likelihood ratio under the null hypothesis.  The chi-square
> > approximation to the distribution is exactly that - an approximation
> -
> > and its validity depends on not testing at the boundary and on having
> > a large sample, in some sense of the sample size.  If I were really
> > interested in evaluating a p-value for the likelihood ratio I would
> > probably try a parametric bootstrap to get a reference distribution.
> >
> 
>   Even if we are not p-value obsessed, we would still presumably
> like to be able make some kind of (even informal) inference from
> the difference in fits, perhaps at the level of "model 1 fits
> (much better|a little better|about the same|a little worse|much worse)
> than model 2", or "the range of plausible estimates for this
> parameter is (tiny|small|moderate|large|absurdly large)". To
> do that we need some kind of metric (if we have not yet fled
> to Bayesian or quasi-Bayesian methods) for the range of
> the deviance under some kind of null case -- for example,
> where should we set cutoff levels on the likelihood profile
> to determine confidence regions for parameters? Parametric
> bootstrap makes sense, although it is a little scary to think
> e.g. of doing a power analysis for such a procedure ...
> 

I took this as a bit of a challenge, to look at the idea of finding the preferred model to use for a mixed effects case (I just stuck with the simple either or choice between 2 models, but this could be expanded to your 5 levels above, or other information).  For this I am using the sample dataset, sleepstudy, from the lme4 package.

Let's start with the example from the help page that basically tests if there is a correlation between the random effects for the intercept and the random effects for the slope:

The following function will simulate data of the same structure with given values for the betas and variances:

sleepsimfun1 <- function(b0, b1, Vb0, Vb1, V01, Serror) {
	mydata <- expand.grid( Days=0:9, Subject=1:18 )
	RE <- MASS::mvrnorm(18, c(0,0), matrix( c(Vb0,V01,V01,Vb1), 2) )
	mydata$Reaction <- with(mydata, 
		(b0+RE[Subject,1]) + (b1+RE[Subject,2])*Days + rnorm(180, 0, Serror)
			)

	fit1 <- lmer(Reaction ~ Days + (Days|Subject), mydata)
	fit2 <- lmer(Reaction ~ Days + (1|Subject) + (0+Days|Subject), mydata)
	anova(fit2,fit1)
}

We can use this to simulate data and analysis from the null condition (no correlation):

pb <- winProgressBar(max=1000)
pbinc <- function() setWinProgressBar(pb, getWinProgressBar(pb)+1)

setWinProgressBar(pb,0)
out1 <- replicate(1000, {pbinc(); 
		sleepsimfun1(251, 10.5, 627, 36, 0, sqrt(653))}, FALSE )

p1 <- sapply(out1, function(x) x[2,7])

This gives the results:

> hist( p1 )
> mean( p1 <= 0.05 )
[1] 0.06
> prop.test( sum(p1 <= 0.05), 1000 )

        1-sample proportions test with continuity correction

data:  sum(p1 <= 0.05) out of 1000, null probability 0.5 
X-squared = 772.641, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5 
95 percent confidence interval:
 0.04645524 0.07702574 
sample estimates:
   p 
0.06 

The histogram is fairly uniform and the size of the test is pretty close to the designated alpha of 0.05, so the regular use of anova is probably appropriate here.

Lets look at a different model comparison, we want to know if the number of days has any effect.  A reduced model without days at all will be simultaneously testing the slope, the random effects on slope, and the correlation between the random effects (3 fewer parameters).

This function simulates the data and compares the 2 models:

sleepsimfun2 <- function(b0, b1, Vb0, Vb1, V01, Serror) {
	mydata <- expand.grid( Days=0:9, Subject=1:18 )
	RE <- MASS::mvrnorm(18, c(0,0), matrix( c(Vb0,V01,V01,Vb1), 2) )
	mydata$Reaction <- with(mydata, 
		(b0+RE[Subject,1]) + (b1+RE[Subject,2])*Days + rnorm(180, 0, Serror)
			)

	fit1 <- lmer(Reaction ~ Days + (Days|Subject), mydata)
	fit2 <- lmer(Reaction ~ 1 + (1|Subject), mydata)
	anova(fit2,fit1)
}

We can test it under the null hypothesis with the following code:

setWinProgressBar(pb,0)
out2 <- replicate(1000, {pbinc(); 
		sleepsimfun2(298.5, 0, 1278.3, 0, 0, sqrt(1959))}, FALSE )

p2 <- sapply(out2, function(x) x[2,7])

The results are:

> hist( p2 )
> mean( p2 <= 0.05 )
[1] 0.033
> prop.test( sum(p2 <= 0.05), 1000 )

        1-sample proportions test with continuity correction

data:  sum(p2 <= 0.05) out of 1000, null probability 0.5 
X-squared = 870.489, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5 
95 percent confidence interval:
 0.02317469 0.04655855 
sample estimates:
    p 
0.033

The histogram shows p-values biased towards 1 rather than uniform and the true size of the test is smaller than the proposed alpha = 0.05.  This could be a good candidate for doing a likelihood comparison based on simulation (parametric bootstrap).

> ll2 <- sapply(out2, function(x) x[2,5])
> hist(ll2)
> quantile( ll2, 0.95 )
    95% 
6.95113 
> fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
> fm3 <- lmer(Reaction ~ 1 + (1|Subject), sleepstudy)
> (ts <- anova(fm1,fm3)[2,5])
[1] 158.5811
> mean( ll2 >= ts )
[1] 0

So 2* log lik difference for the real data is greater than all the 2* log lik differences from the simulation (should probably redo it with exact values instead of my quick rounding off, but the results should be similar), this is roughly equivalent to a p-value of close to 0.

But we may be interested in if this approach really behaves properly, and what would our power be for this type of study but with different values.

This function simulates the above procedure by generating 1000 values simulated under the null hypothesis to get a reference distribution, then simulates the data and tests it by comparing to the refrence distribution.  (this version creates the reference distribution from the hypothesized parameters, an alternative would be to generate the data first, fit the reduced model, then simulate the refrence distribution based on that fit, which is better is another discussion):

sleepsimfun3 <- function(b0, b1, Vb0, Vb1, V01, Serror) {
	setWinProgressBar(pb,0)
	out.null <- replicate(1000, {pbinc(); 
		sleepsimfun2(b0, 0, Vb0, 0, 0, Serror)[2,5]} )

	mydata <- expand.grid( Days=0:9, Subject=1:18 )
	RE <- MASS::mvrnorm(18, c(0,0), matrix( c(Vb0,V01,V01,Vb1), 2) )
	mydata$Reaction <- with(mydata, 
		(b0+RE[Subject,1]) + (b1+RE[Subject,2])*Days + rnorm(180, 0, Serror)
			)

	fit1 <- lmer(Reaction ~ Days + (Days|Subject), mydata)
	fit2 <- lmer(Reaction ~ 1 + (1|Subject), mydata)
	ts <- anova(fit2,fit1)[2,5]
	mean( out.null >= ts, na.rm=TRUE )
}

We can run it under the null hypothesis (arbitrarily chosen values for the non-zero parameters) to test the behavior (note for anyone running the code for themselves, I started the next 2 batches of code running before leaving work Friday afternoon, the first batch ended late Saturday morning and the 2 batch ended sometime after I went to bed on Saturday):

pb2 <- winProgressBar('Progress Bar 2',max=1000)
pbinc2 <- function() setWinProgressBar(pb2, getWinProgressBar(pb2)+1)

# check under the null

setWinProgressBar(pb2, 0)
out3 <- replicate(1000, {pbinc2(); 
		sleepsimfun3(100, 0, 1000, 0, 0, 45)} )

The output:

> hist(out3)
> mean(out3 <= 0.05)
[1] 0.056
> prop.test( sum(out3<=0.05), 1000)

        1-sample proportions test with continuity correction

data:  sum(out3 <= 0.05) out of 1000, null probability 0.5 
X-squared = 786.769, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5 
95 percent confidence interval:
 0.04293596 0.07258036 
sample estimates:
    p 
0.056 

The histogram looks close enough to uniform (I know I could do specific tests, additional plots, but I am just going for the eyeball close enough test here) and the size of the test is close enough to the target alpha=0.05

Now let's try a power analysis for a given slope (arbitrarily chosen in this case):

# find power for b1=5, Vb1=0, V01=0

setWinProgressBar(pb2, 0)
out4 <- replicate(1000, {pbinc2(); 
		sleepsimfun3(100, 5, 1000, 0, 0, 45)} )

with results:

> hist(out4)
> mean(out4 <= 0.05)
[1] 0.972
> prop.test( sum(out4<=0.05), 1000)

        1-sample proportions test with continuity correction

data:  sum(out4 <= 0.05) out of 1000, null probability 0.5 
X-squared = 889.249, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5 
95 percent confidence interval:
 0.9592453 0.9809686 
sample estimates:
    p 
0.972

So we have over 95% power to detect a true slope of 5 (conditional on the other parameters set).

So while a power study may take a bit of time (parallelization will help), it seems feasible (scary or not depends on the person still).

And to think that a few years ago when I was trying to get the flawed Pentium processor in my computer replaced I got the runaround with each person trying to explain that I did not need it replaced because only large corporations and research scientists who do tens of thousands floating point operations would be affected by the flaw.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111