[R-sig-ME] Why do extremely similar parametric bootstraps on lmer() objects yield (not only marginally) different results?

Sat May 3 21:38:30 CEST 2014

>  * is the difference between p=0.004 and p=0.006 _practically_
> important?
No, I don’t think that it is important at all. Mostly because I don’t rely on p
values. (But a lot of people still like to see them.) In this special case it
is not important because we are looking at a p that is significantly smaller
than the specified alpha level; in this case the alpha level is the abundant
0.05. Still, the results differ at least by 0.002. This phenomenon seems stable
across different ways to treat NAs (as I’ve tried to show). I think this
suggests that one of the methods has a systematical bias. But I don’t know
which method. Hence, I don’t know if it is an anticonservative or a
conservative bias. It was pure curiosity which of the two methods it is. Any
hints on how to find out are welcome. (I saw you suggested looking at a single
refit while setting seed.)

>  * for this example, it seems that (as you comment) almost any approach
> will give reasonable answers -- simple LRT (because the number of levels
> of the minimum-size grouping variable is >=40), Kenward-Roger (because
> this is a LMM).  Parametric bootstrap feels like a waste of computation
> time.
Agreed. But the computation time didn’t matter much to me since all these
things are run on a grid at 5 in the morning and finish in 30 min. Usually
nobody is doing anyhting on it at that time.

> * I can't do it off the top of my head, but for a balanced/fully crossed
> random effect, one might even be able to compute the denominator df
> for the F distribution exactly.
I actually did not know that. If you have any papers, code or explanatio that
you could share with me I would greatly appreciate it!

> * is there a reason you're not using refit() in method 2?  It should
> be faster.
I am usually only comfortable with calling in functions which I had the time to
understand (reasonably) properly (e.g. by looking at its code.) and of which I
know how they're doing what they're doing. In this manner I can make sure that
I can stand for the things I calculated. I am not that acquainted with refit()
so I was up until now relying on doing stuff by hand. Furthermore, I like to do
stuff by hand first. This is just to check if I know what I am doing.

> * does PBmodcomp() work with the latest version of lme4, with the
> simulate/NA fix?  Or do we or the pbkrtest maintainers still have work
> to do?
I just did a test with pbkrtest_0.3-8 and the github version of lme4_1.1-7. If
I specify my models:

mod8 <- lmer(formula = log(value)
             ~ passung
             + (1 + satz_typ | vp)
             + (1 + satz_typ | kontext),
             data = wort12_lmm2_melt,
             REML = FALSE)

mod_min <- lmer(formula = log(value)
                ~ 1
                + (1 + satz_typ | vp)
                + (1 + satz_typ | kontext),
                data = wort12_lmm2_melt,
                REML = FALSE)

and call:

pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
                                smallModel = mod_min_nona,
                                nsim       = 100,
                                cl         = cl)

without removing the na.action attribute, I get:

> pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
+                                 smallModel = mod_min_nona,
+                                 nsim       = 100,
+                                 cl         = cl)
Error in checkForRemoteErrors(lapply(cl, recvResult)) : 
  one node produced an error: replacement has 10263 rows, data has 10713

> * I'm a little bit worried by method 2 -- I can imagine (although haven't
> tested) that ysim may be a different length/may not match mod at frame,
> which will be NA-processed. However, I think it should work OK
> if the default na.action is 'na.omit' (I'd prefer that people _never_ used @
> in their code ...)
That actually was a problem. If I fit the models with lmer(..., na.action =
na.exclude) but leave the rest the way it is. I will get a model.frame.default
error message which has to do with differing lengths.  That’s why I redid
Method 2 with Method 3 were I constructed a data frame with complete cases.

Thank you so much for taking a look at this!
Christian

Eberhard Karls Universität Tübingen
Mathematisch-Naturwissenschaftliche Fakultät - Faculty of Science
Evolutionary Cognition - Cognitive Science
Schleichstraße 4 · 72076 Tübingen · Germany
Telefon +49 7071 29-75643 · Telefax +49 7071 29-4721
christian.brauner at uni-tuebingen.de

On Sat, May 03, 2014 at 02:14:30PM -0400, Ben Bolker wrote:
> On 14-05-03 07:59 AM, Christian Brauner wrote:
> > Hello all,
> > 
> 
> > warning ahead: This will be quite a long post. But have faith, it is
> >  necessary.
> 
>   I have some questions about the larger context here.  I agree that
> it's important to understand the behaviour of the code, and to ferret
> out possible bugs in the code (I will say that handling all of the
> permutations of NA handling in the predict and simulate code is a big
> headache ...)  However, I'm wondering about a few things:
> 
>  * is the difference between p=0.004 and p=0.006 _practically_
> important?
>  * for this example, it seems that (as you comment) almost any approach
> will give reasonable answers -- simple LRT (because the number of levels
> of the minimum-size grouping variable is >=40), Kenward-Roger (because
> this is a LMM).  Parametric bootstrap feels like a waste of computation
> time.
> * I can't do it off the top of my head, but for a balanced/fully crossed
> random effect, one might even be able to compute the denominator df
> for the F distribution exactly.
> * is there a reason you're not using refit() in method 2?  It should
> be faster.
> * does PBmodcomp() work with the latest version of lme4, with the
> simulate/NA fix?  Or do we or the pbkrtest maintainers still have work
> to do?
> * I'm a little bit worried by method 2 -- I can imagine (although haven't
> tested) that ysim may be a different length/may not match mod at frame,
> which will be NA-processed. However, I think it should work OK
> if the default na.action is 'na.omit' (I'd prefer that people _never_ used @
> in their code ...)
> 
> 
> > I run parametric bootstraps in order to compare two models. I
> > bootstrap the likelihood ratio between the models. For anybody
> > acquainted with this, the code should be self explanatory. Model
> > selection is more or less done and these are exercises I like to
> > perform with real data sets to get a feel for how different
> > softwares treat the same models. Unfortunately, I am not in a
> > position to provide the real data. A few words about the models
> > though:
> 
> > Both models are fit on a balanced data set which includes around 483 missing
> > values. There are slightly above 11196 observations (within: 70 subjects).
> > Every subject saw every item only one time. The dependent variable are reading
> > times; sentence_type and matching are crossed two-level factors. Context and
> > subject are treated as random effects. Context has 40 levels and (well you
> > guessed) subjects has 70 levels. (See this related post on:
> > http://stats.stackexchange.com/questions/94302/lmer-parametric-bootstrap-testing-for-fixed-effects):
> > 
> > Model formulae:
> > ===============
> >         
> > # I only show the model formulae here because this piece stays fixed with
> > # every call to lmer(). The options change however, depending on the parametric  
> > # bootstrap procedure used. Any change can be seen in the code:
> > mod_min := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext)
> > mod8    := value ~ passung + (1 + satz_typ | vp) + (1 + satz_typ | kontext)
> > 
> > I set up different ways to perform parametric bootstraps. I will describe the
> > different ways and explain what I changed every time and show the results. The
> > ultimate question will be: Why the different results? Especially in one of the
> > ways to perform a parametric bootstrap.
> 
>   Making my own summary:
> 
> 1. pbkrtest, na.action attribute removed. p=0.003
> 2. lmer refitting: p=0.0064
> 3. fitted on complete cases: p=0.0066
> 4. using na.exclude and refit: p = 0.0043
> 5. using complete cases and refit: p = 0.004
> 
> So I guess to boil it down, the difference is in the way that refit
> handles missing values / accesses the model frame.  If I were going
> to pursue this farther, I would probably try to strip it down and look
> at the results of a _single_ refit with the random number seed set to
> be identical in every case ...
> 
> 
> > 
> > Method 1 (pbkrtest).
> > ====================
> > 
> > 1.1. lmer() model calls:
> > ------------------------
> > 
> > mod_min <- lmer(formula = log(value)
> >                 ~ 1
> >                 + (1 + satz_typ | vp)
> >                 + (1 + satz_typ | kontext),
> >                 data = wort12_lmm2_melt,
> >                 REML = FALSE)
> > 
> > mod8 <- lmer(formula = log(value)
> >              ~ passung
> >              + (1 + satz_typ | vp)
> >              + (1 + satz_typ | kontext),
> >              data = wort12_lmm2_melt,
> >              REML = FALSE)
> > 
> > 1.2. Parametric Bootstrap:
> > --------------------------
> > 
> > # remove NAs otherwise PBmodcomp() will complain as it calls the simulate()
> > # function directly without the argument na.action = na.exclude
> > # Relevant: See Ben Bolkers comments on this:
> > # https://github.com/lme4/lme4/issues/189 dirty fix:
> > mod8_nona    <- mod8
> > mod_min_nona <- mod_min
> > 
> > attr(x     = mod8_nona at frame, 
> >      which = "na.action") <- NULL # removing na.action attribute from the model
> > 
> > attr(x     = mod_min_nona at frame,
> >      which = "na.action") <- NULL # removing na.action attribute from the model
> > 
> > library(pbkrtest)
> > 
> > pbmod_mod8_mod_min <- PBmodcomp(largeModel = mod8_nona,
> >                                 smallModel = mod_min_nona,
> >                                 nsim       = 10000,
> >                                 cl         = cl)
> > 
> > 
> > Take home message PBmodcomp(): p.value = 0.003
> > 
> > 1.3 Notable:
> > ------------
> > na.action attribute has been removed from both models before simulate() call.
> > 
> > 
> > Method 2.
> > =========
> > 
> > 2.1 lmer() model calls:
> > -----------------------
> > 
> > identical to Method 1 (pbkrtest).
> > 
> > 2.2 Parametric Bootstrap:
> > -------------------------
> > 
> > mod <- mod8
> > modnull <- mod_min
> > # save the observed likelihood ratio test statistic
> > lrt_obs <- anova(modnull,
> >                  mod)$Chisq[2]
> > 
> > n_sim <- 10000
> > 
> > dattemp <- mod at frame
> > 
> > ptime <- system.time({
> >     lrt_sim <- foreach (i = 1:n_sim,
> >                         .combine  = "c",
> >                         .packages = "lme4") %dopar% {
> > 
> >         ysim       <- unlist(x = simulate(object = modnull))
> > 
> >         modnullsim <- lmer(formula = ysim
> >                            ~ 1
> >                            + (1 + satz_typ | vp)
> >                            + (1 + satz_typ | kontext),
> >                            data = dattemp,
> >                            REML = FALSE) # Fit the null model.
> > 
> >         modaltsim  <- lmer(formula = ysim
> >                            ~ passung
> >                            + (1 + satz_typ | vp)
> >                            + (1 + satz_typ | kontext),
> >                            data = dattemp,
> >                            REML = FALSE) # Fit the alternative model.
> > 
> >         anova(modnullsim,
> >               modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.
> > 
> >     }
> > })[3]
> > 
> > sum(lrt_sim >= lrt_obs) + 1)/(n_sim + 1) = 0.00639936 # p.value
> > 
> > Take home message Method 2: p.value = 0.00639936
> > 
> > 2.3 Notable:
> > ------------
> > na.action has not been altered!
> > 
> > 
> > Method 3.
> > =========
> > 3.1 lmer() model calls:
> > -----------------------
> > 
> > # differences explained in 3.3
> > mod_min <- lmer(formula = log(value)
> >                 ~ 1
> >                 + (1 + satz_typ | vp)
> >                 + (1 + satz_typ | kontext),
> >                 data = wort12_lmm2_melt_noa,
> >                 REML = FALSE)
> > 
> > mod8 <- lmer(formula = log(value)
> >              ~ passung
> >              + (1 + satz_typ | vp)
> >              + (1 + satz_typ | kontext),
> >              data = wort12_lmm2_melt_nona,
> >              REML = FALSE)
> > 
> > 3.2 Parametric Bootstrap:
> > -------------------------
> > 
> > indentical to Method 2
> > 
> > Take home message Method 3: p.value = 0.00659934
> > 
> > 
> > 3.3 Notable:
> > ------------
> > 
> > The models were fit on the original dataset which only included complete cases.
> > That is, on a data set that was altered by the following code:
> > 
> > wort12_lmm2_melt_nona <-
> >     wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]
> > 
> > 
> > Method 4.
> > =========
> > 4.1 lmer() model calls:
> > -----------------------
> > 
> > # differences explained in 4.3
> > mod_min_naexclude <- lmer(formula = log(value)
> >                           ~ 1
> >                           + (1 + satz_typ | vp)
> >                           + (1 + satz_typ | kontext),
> >                           data      = wort12_lmm2_melt,
> >                           REML      = FALSE,
> >                           na.action = na.exclude)
> > 
> > mod8_naexclude <- lmer(formula = log(value)
> >                        ~ passung
> >                        + (1 + satz_typ | vp)
> >                        + (1 + satz_typ | kontext),
> >                        data      = wort12_lmm2_melt,
> >                        REML      = FALSE,
> >                        na.action = na.exclude)
> > 
> > 
> > 4.2 Parametric Bootstrap:
> > -------------------------
> > mod <- mod8_naexclude
> > modnull <- mod_min_naexclude
> > 
> > lrt_obs <- anova(modnull,
> >                  mod)$Chisq[2] # Save the observed likelihood ratio test statistic.
> > 
> > n_sim <- 10000
> > 
> > ptime <- system.time({
> >     lrt_sim <- foreach (i = 1:n_sim,
> >                         .combine   = "c",
> >                         .packages = "lme4") %dopar% {
> > 
> >         ysim       <- unlist(simulate(object = modnull))
> > 
> >         modnullsim <- refit(object  = modnull,
> >                             newresp = ysim) # Fit the null-model.
> > 
> >         modaltsim  <- refit(object  = mod,
> >                             newresp = ysim) # Fit the alternative model.
> > 
> >         anova(modnullsim,
> >               modaltsim)$Chisq[2] # Save the likelihood ratio test statistic.
> > 
> >     }
> > })[3]
> > 
> > Take home message Method 4: p.value = 0.00429957
> > 
> > 
> > 4.3 Notable:
> > ------------
> > The models were fit with the option na.action = na.exclude.
> > 
> > 
> > Method 5.
> > =========
> > 
> > 5.1 lmer() model calls:
> > -----------------------
> > 
> > # differences explained in 5.3
> > mod_min_naexclude <- lmer(formula = log(value)
> >                           ~ 1
> >                           + (1 + satz_typ | vp)
> >                           + (1 + satz_typ | kontext),
> >                           data      = wort12_lmm2_melt_nona,
> >                           REML      = FALSE)
> > 
> > mod8_naexclude <- lmer(formula = log(value)
> >                        ~ passung
> >                        + (1 + satz_typ | vp)
> >                        + (1 + satz_typ | kontext),
> >                        data      = wort12_lmm2_melt_nona,
> >                        REML      = FALSE)
> > 
> > 
> > 5.2 Parametric Bootstrap:
> > -------------------------
> > 
> > indentical to Method 4
> > 
> > Take home message Method 5: p.value = 0.0039996
> > 
> > 5.3 Notable:
> > ------------
> > The models were fit on the original dataset which only included complete cases.
> > That is, on a data set that was altered by the following code:
> > 
> > wort12_lmm2_melt_nona <-
> >     wort12_lmm2_melt[complete.cases(wort12_lmm2_melt), ]
> > 
> > 
> > Question:
> > =========
> > 
> > Why do Method 2 and Method 3 yield such (in my opinion) p.values that deviate
> > by about 0.002 from all other methods?
> > 
> > (I think at least the explanation of why I do not see it is really simple: I
> > just have been looking to long at this code and I am missing the obvious.)
> > 
> > Two final remarks:
> > 
> > (1) I ran everything on a cluster. That’s why I repeated Methods 2 and Methods
> > 3. The results they yield seem to be numerically stable.
> > 
> > (2) I plotted the likelihood ratio test statistics for all methods and compared
> > them to a chi-squared distribution with df=1, and their likelihood ratio test
> > density. Everything is normal. The chi-square assumption is correct for these
> > models
> > 
> > Thank you so much for your patience!
> > 
> > Best,
> > Christian Brauner
> > 
> > _______________________________________________
> > R-sig-mixed-models at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> > 
>