[R-sig-ME] binomial fixed-effect p-values by simulation

Sat Aug 23 02:45:47 CEST 2008

Daniel,

This is almost parametric bootstrapping. Rather than looking at the  
distribution of the parameter estimates it looks at the distribution  
of the test statistic (z value) and should be an improvement. I'm not  
certain of the correct way but summary(fixed.sim)@coefs will return  
them.

Unless the sample size is small I wouldn't be particularly worried  
about the distribution of the fixed effects estiamtes and I expect in  
your example there wont be much difference.

Another option is to use the sandwich estimator with logistic  
regression.

Ken

On 23/08/2008, at 8:22 AM, Daniel Ezra Johnson wrote:

> Consider the common situation of modeling multiple responses from
> multiple subjects, where we want to know whether a fixed-effect
> between-subjects variable -- for example, gender -- has a significant
> effect.
>
> In my field (quantitative sociolinguistics) the response is usually
> binomial, and for some 30 years the usual procedure has been to carry
> out logistic regression, setting the subject variable aside, and
> assessing fixed effect significance with a likelihood-ratio test. When
> between-subject variation is high, this produces exceptionally
> anti-conservative p-values (see old.p.value below).
>
> I am writing an article and some wrapper functions designed to
> introduce the use of mixed models to this audience. With subject
> included as a random effect, the p-values for fixed between-subject
> effects are much more conservative, even using the likelihood-ratio
> test (see lrt.p.value below). In other words, this is an improvement.
>
> But as has been discussed many times on this list, likelihood-ratio
> tests are still anti-conservative for fixed effect significance. I
> believe the extent of anti-conservatism depends on the number of
> observations relative to the number of groups, the balance of the
> data, and probably other factors too.
>
> I have seen the MCMC method discussed, but I think this is not yet
> available for binomial data with lmer(), is that correct? In addition,
> that method, which operates in the parameter space, seems to be
> testing a slightly different null hypothesis than the usual
> frequentist one, though I may be misunderstanding this.
>
> Would it be appropriate for me to carry out a simulation of the type
> given below, to assess fixed effect significance?
>
> The code starts by generating a dataset which happens to have a
> potentially significant gender effect, according to the LRT. (In fact
> the data is randomly generated and has no "real" gender effect, but
> assume it was real data and we did not know that.)
>
> A null model (no fixed effect) and an alternative model (fixed gender
> effect) are fit to the data by lmer(), using a random effect for
> subject in both cases.
>
> It then takes the two parameters from the null model -- intercept and
> subject s.d. -- and randomly generates new data. The simulated data is
> fit with the alternative-style model and the proportion of times the
> fixed effect magnitude exceeds that from the 'observed' data is the
> p-value.
>
> Does this procedure make sense? It would generalize to linear models
> easily. Making it work for unbalanced data and when there are other
> fixed effects in the model would not be that hard either, I think.
>
> To my mind, this is a direct implementation of the frequentist p-value
> concept via simulation. Am I missing something important?
>
> Of course, especially with binomial data this simulation is quite slow
> (i.e. compared to the unimplemented MCMC method). The code below, with
> 1000 simulations of a 2000-observation dataset, took 15 minutes to run
> on a MacBook Pro. (The result was a p-value of 0.027, compared to
> 0.015 from the LRT, 0.009 from the Wald statistic... and 1.1e-56 from
> the fixed-effects model!)
>
> It's important to me to be able to give p-values that are
> accurate/conservative and that I understand. So if anyone can give me
> guidance on whether this simulation procedure makes sense, it would be
> much appreciated.
>
> And if there's a faster way to do something like this for binomial
> data, please let me know.
>
> Thanks,
> Daniel
>
> library(lme4)
> library(boot)
> set.seed(10)
>
> observed <- data.frame(subject = rep(letters[1:20],each=100),
> 	gender = rep(c("M","F"),each=1000),
> 	response = rbinom(2000,1,rep(plogis(rnorm(20,0,3)),each=100)))
>
> old <- glm(response~gender,binomial,observed)
> old.p.value <- pchisq(old$null.deviance-old 
> $deviance,df=1,lower.tail=F)
> # 1.083e-56
>
> null <- lmer(response ~ (1|subject),observed,binomial)
> fixed <- lmer(response ~ gender + (1|subject),observed,binomial)
>
> wald.p.value <-  summary(fixed)@coefs["genderM","Pr(>|z|)"] # 0.00894
> lrt.p.value <- anova(null,fixed)[,"Pr(>Chisq)"][2] # 0.0152
>
> null.intercept <- fixef(null) # -0.0914
> null.subject.sd <- attr(VarCorr(null)$subject,"stddev") # 2.276
> gender.observed <- fixef(fixed)[2] # -2.319
> gender.simulated <- vector()
>
> for (i in 1:1000){
> simulated <- data.frame(subject = rep(letters[1:20],each=100),
> 	gender = rep(c("M","F"),each=1000),
> 	response =  
> rbinom(2000,1,rep(plogis(rnorm(20,0,null.subject.sd)),each=100)))
> fixed.sim <- lmer(response~gender+(1|subject),simulated,binomial)
> gender.simulated[i] <- fixef(fixed.sim)[2]}
>
> # minutes later...
>
> simulation.p.value <-  
> mean(abs(gender.simulated)>=abs(gender.observed)) # 0.027
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>