[R-sig-ME] Why do extremely similar parametric bootstraps on lmer() objects yield (not only marginally) different results?

Sat May 3 20:14:30 CEST 2014

On 14-05-03 07:59 AM, Christian Brauner wrote:
> Hello all,
> 

> warning ahead: This will be quite a long post. But have faith, it is
>  necessary.

  I have some questions about the larger context here.  I agree that
it's important to understand the behaviour of the code, and to ferret
out possible bugs in the code (I will say that handling all of the
permutations of NA handling in the predict and simulate code is a big
headache ...)  However, I'm wondering about a few things:

 * is the difference between p=0.004 and p=0.006 _practically_
important?
 * for this example, it seems that (as you comment) almost any approach
will give reasonable answers -- simple LRT (because the number of levels
of the minimum-size grouping variable is >=40), Kenward-Roger (because
this is a LMM).  Parametric bootstrap feels like a waste of computation
time.
* I can't do it off the top of my head, but for a balanced/fully crossed
random effect, one might even be able to compute the denominator df
for the F distribution exactly.
* is there a reason you're not using refit() in method 2?  It should
be faster.
* does PBmodcomp() work with the latest version of lme4, with the
simulate/NA fix?  Or do we or the pbkrtest maintainers still have work
to do?
* I'm a little bit worried by method 2 -- I can imagine (although haven't
tested) that ysim may be a different length/may not match mod at frame,
which will be NA-processed. However, I think it should work OK
if the default na.action is 'na.omit' (I'd prefer that people _never_ used @
in their code ...)

> I run parametric bootstraps in order to compare two models. I
> bootstrap the likelihood ratio between the models. For anybody
> acquainted with this, the code should be self explanatory. Model
> selection is more or less done and these are exercises I like to
> perform with real data sets to get a feel for how different
> softwares treat the same models. Unfortunately, I am not in a
> position to provide the real data. A few words about the models
> though:

> Both models are fit on a balanced data set which includes around 483 missing
> values. There are slightly above 11196 observations (within: 70 subjects).
> Every subject saw every item only one time. The dependent variable are reading
> times; sentence_type and matching are crossed two-level factors. Context and
> subject are treated as random effects. Context has 40 levels and (well you
> guessed) subjects has 70 levels. (See this related post on:
> http://stats.stackexchange.com/questions/94302/lmer-parametric-bootstrap-testing-for-fixed-effects):
> 
> Model formulae:
> ===============
>         
> # I only show the model formulae here because this piece stays fixed with
> # every call to lmer(). The options change however, depending on the parametric  
> # bootstrap procedure used. Any change can be seen in the code:
> mod_min := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext)
> mod8    := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext)
> 
> I set up different ways to perform parametric bootstraps. I will describe the
> different ways and explain what I changed every time and show the results. The
> ultimate question will be: Why the different results? Especially in one of the
> ways to perform a parametric bootstrap.

  Making my own summary:

1. pbkrtest, na.action attribute removed. p=0.003
2. lmer refitting: p=0.0064
3. fitted on complete cases: p=0.0066
4. using na.exclude and refit: p = 0.0043
5. using complete cases and refit: p = 0.004

So I guess to boil it down, the difference is in the way that refit
handles missing values / accesses the model frame.  If I were going
to pursue this farther, I would probably try to strip it down and look
at the results of a _single_ refit with the random number seed set to
be identical in every case ...

> 
> Method 1 (pbkrtest).
> ====================
> 
> 1.1. lmer() model calls:
> ------------------------
> 
> mod_min <- lmer(formula = log(value)
>                 ~ 1
>                 + (1 + satz_typ | vp)
>                 + (1 + satz_typ | kontext),
>                 data = wort12_lmm2_melt,
>                 REML = FALSE)
> 
> mod8 <- lmer(formula = log(value)
>              ~ passung
>              + (1 + satz_typ | vp)
>              + (1 + satz_typ | kontext),
>              data = wort12_lmm2_melt,
>              REML = FALSE)
> 
> 1.2. Parametric Bootstrap:
> --------------------------
> 
> # remove NAs otherwise PBmodcomp() will complain as it calls the simulate()
> # function directly without the argument na.action = na.exclude
> # Relevant: See Ben Bolkers comments on this:
> # https://github.com/lme4/lme4/issues/189 dirty fix:
> mod8_nona    <- mod8
> mod_min_nona <- mod_min
> 
> attr(x     = mod8_nona at frame, 
>      which = "na.action") <- NULL # removing na.action attribute from the model
> 
> attr(x     = mod_min_nona at frame,
>      which = "na.action") <- NULL # removing na.action attribute from the model
> 
> library(pbkrtest)
> 
> pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
>                                 smallModel = mod_min_nona,
>                                 nsim       = 10000,
>                                 cl         = cl)
> 
> 
> Take home message PBmodcomp(): p.value = 0.003
> 
> 1.3 Notable:
> ------------
> na.action attribute has been removed from both models before simulate() call.
> 
> 
> Method 2.
> =========
> 
> 2.1 lmer() model calls:
> -----------------------
> 
> identical to Method 1 (pbkrtest).
> 
> 2.2 Parametric Bootstrap:
> -------------------------
> 
> mod <- mod8
> modnull <- mod_min
> # save the observed likelihood ratio test statistic
> lrt_obs <- anova(modnull,
>                  mod)$Chisq[2]
> 
> n_sim <- 10000
> 
> dattemp <- mod at frame
> 
> ptime <- system.time({
>     lrt_sim <- foreach (i = 1:n_sim,
>                         .combine  = "c",
>                         .packages = "lme4") %dopar% {
> 
>         ysim       <- unlist(x = simulate(object = modnull))
> 
>         modnullsim <- lmer(formula = ysim
>                            ~ 1
>                            + (1 + satz_typ | vp)
>                            + (1 + satz_typ | kontext),
>                            data = dattemp,
>                            REML = FALSE) # Fit the null model.
> 
>         modaltsim  <- lmer(formula = ysim
>                            ~ passung
>                            + (1 + satz_typ | vp)
>                            + (1 + satz_typ | kontext),
>                            data = dattemp,
>                            REML = FALSE) # Fit the alternative model.
> 
>         anova(modnullsim,
>               modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.
> 
>     }
> })[3]
> 
> sum(lrt_sim >= lrt_obs) + 1)/(n_sim + 1) = 0.00639936 # p.value
> 
> Take home message Method 2: p.value = 0.00639936
> 
> 2.3 Notable:
> ------------
> na.action has not been altered!
> 
> 
> Method 3.
> =========
> 3.1 lmer() model calls:
> -----------------------
> 
> # differences explained in 3.3
> mod_min <- lmer(formula = log(value)
>                 ~ 1
>                 + (1 + satz_typ | vp)
>                 + (1 + satz_typ | kontext),
>                 data = wort12_lmm2_melt_noa,
>                 REML = FALSE)
> 
> mod8 <- lmer(formula = log(value)
>              ~ passung
>              + (1 + satz_typ | vp)
>              + (1 + satz_typ | kontext),
>              data = wort12_lmm2_melt_nona,
>              REML = FALSE)
> 
> 3.2 Parametric Bootstrap:
> -------------------------
> 
> indentical to Method 2
> 
> Take home message Method 3: p.value = 0.00659934
> 
> 
> 3.3 Notable:
> ------------
> 
> The models were fit on the original dataset which only included complete cases.
> That is, on a data set that was altered by the following code:
> 
> wort12_lmm2_melt_nona <-
>     wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]
> 
> 
> Method 4.
> =========
> 4.1 lmer() model calls:
> -----------------------
> 
> # differences explained in 4.3
> mod_min_naexclude <- lmer(formula = log(value)
>                           ~ 1
>                           + (1 + satz_typ | vp)
>                           + (1 + satz_typ | kontext),
>                           data      = wort12_lmm2_melt,
>                           REML      = FALSE,
>                           na.action = na.exclude)
> 
> mod8_naexclude <- lmer(formula = log(value)
>                        ~ passung
>                        + (1 + satz_typ | vp)
>                        + (1 + satz_typ | kontext),
>                        data      = wort12_lmm2_melt,
>                        REML      = FALSE,
>                        na.action = na.exclude)
> 
> 
> 4.2 Parametric Bootstrap:
> -------------------------
> mod <- mod8_naexclude
> modnull <- mod_min_naexclude
> 
> lrt_obs <- anova(modnull,
>                  mod)$Chisq[2] # Save the observed likelihood ratio test statistic.
> 
> n_sim <- 10000
> 
> ptime <- system.time({
>     lrt_sim <- foreach (i = 1:n_sim,
>                         .combine   = "c",
>                         .packages = "lme4") %dopar% {
> 
>         ysim       <- unlist(simulate(object = modnull))
> 
>         modnullsim <- refit(object  = modnull,
>                             newresp = ysim) # Fit the null-model.
> 
>         modaltsim  <- refit(object  = mod,
>                             newresp = ysim) # Fit the alternative model.
> 
>         anova(modnullsim,
>               modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.
> 
>     }
> })[3]
> 
> Take home message Method 4: p.value = 0.00429957
> 
> 
> 4.3 Notable:
> ------------
> The models were fit with the option na.action = na.exclude.
> 
> 
> Method 5.
> =========
> 
> 5.1 lmer() model calls:
> -----------------------
> 
> # differences explained in 5.3
> mod_min_naexclude <- lmer(formula = log(value)
>                           ~ 1
>                           + (1 + satz_typ | vp)
>                           + (1 + satz_typ | kontext),
>                           data      = wort12_lmm2_melt_nona,
>                           REML      = FALSE)
> 
> mod8_naexclude <- lmer(formula = log(value)
>                        ~ passung
>                        + (1 + satz_typ | vp)
>                        + (1 + satz_typ | kontext),
>                        data      = wort12_lmm2_melt_nona,
>                        REML      = FALSE)
> 
> 
> 5.2 Parametric Bootstrap:
> -------------------------
> 
> indentical to Method 4
> 
> Take home message Method 5: p.value = 0.0039996
> 
> 5.3 Notable:
> ------------
> The models were fit on the original dataset which only included complete cases.
> That is, on a data set that was altered by the following code:
> 
> wort12_lmm2_melt_nona <-
>     wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]
> 
> 
> Question:
> =========
> 
> Why do Method 2 and Method 3 yield such (in my opinion) p.values that deviate
> by about 0.002 from all other methods?
> 
> (I think at least the explanation of why I do not see it is really simple: I
> just have been looking to long at this code and I am missing the obvious.)
> 
> Two final remarks:
> 
> (1) I ran everything on a cluster. That’s why I repeated Methods 2 and Methods
> 3. The results they yield seem to be numerically stable.
> 
> (2) I plotted the likelihood ratio test statistics for all methods and compared
> them to a chi-squared distribution with df=1, and their likelihood ratio test
> density. Everything is normal. The chi-square assumption is correct for these
> models
> 
> Thank you so much for your patience!
> 
> Best,
> Christian Brauner
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>