[R-sig-ME] Most principled reporting of mixed-effect model regression coefficients

Fri Feb 28 08:18:51 CET 2020

Thanks, Daniel and Paul.

Yes, I did read that conditional R^2 is higher. From the N&S article, it seems that it represents variance explained by fixed and random factors. Still, depending on the outcome measure, it seems that there would exist a good deal of variance still unaccounted for even considering random factors.

Thanks for the synopses of different packages. I'll try them out and see whether they yield similar answers. It's also helpful to know the ways in which they differ for current and future use.

Regarding your last comment, I think the AIC does a good job of selection and parameter penalization (which is important to my focus), however, when it comes to comparing and communicating the differences between models using different performance measures, AIC, it seems, can really only say whether one model is better than another, but can't really say how much better. This is important, especially when you have to weigh pros and cons of different performance measures and metrics. For instance, with executive function tasks--some recent research demonstrates lack of test-retest reliability. If an aggregate is more reliable, but doesn't explain as much variance, then it's tough to answer which metric is better, which is where it would be nice to quantify the predictive capabilities differences between the models.

From what I've read, different methods of CV introduce different biases. Do you think just splitting the data 80/20 (with random sampling by participants and their corresponding timepoints) would be a sufficient/principled way forward? Then use RMSE or another summary measure to compare models?  If I'm using the predict function, it will output a prediction measure for every timepoint. Would I only take the 4th timepoint prediction to compare with the observed to then determine the RMSE? I.e.

```
predict.fun <- function(train.model) {
  predict(train.model, test.dataset, re.form = NA)
}
test.dataset$outcome <- predict.fun(train.model)
```
Just to be clear about R^2 in multilevel models--am I understanding correctly that it shouldn't be used to look at individual predictors and the variance explained; rather, it should only be used as a measure to quantify the variance explained by an entire model?

This is all super helpful. Thanks!

James

________________________________
From: Daniel Lüdecke <d.luedecke using uke.de>
Sent: Wednesday, February 26, 2020 7:49 AM
To: Ades, James <jades using health.ucsd.edu>
Cc: r-sig-mixed-models using r-project.org <r-sig-mixed-models using r-project.org>
Subject: AW: [R-sig-ME] Most principled reporting of mixed-effect model regression coefficients

Hi James,

> Using the N&S equation, for one of my models, I get an R^2 of .35, while using r.squaredGLMM, I get an R^2 of .43. I can't imagine that the random slope of time would make that big of a difference. (The conditional R^2 is .95, and I have no idea how it's that high).

The conditional R2 is usually higher, in case of random slopes probably much higher. You can try to compare your results with performance::r2_nakagawa(), piecewiseSEM::rsquared() and MuMIn::r.squaredGLMM() (see also next answer below).

> Does anyone have any experience with the package?

In most cases, these packages yield similar to identical results. There are, however, certain situations where the one or other package might fit better to your need. In my experience, these functions differ in following points

- r2_nakagawa() only uses log-approximation, while rsquared() and r.squaredGLMM() also yield the results from the delta-approximation.
- all three functions yield consistent results for linear mixed models, including random slopes
- rsquared() fails for binomial models where the response is cbind(trials, success) or similar
- r2_nakagawa() and r.squaredGLMM() can handle models from package glmmTMB
- all three functions yield different results for "glmer(count ~ spp + mined + (1 | site), family = poisson, data = Salamanders)"
- currently, only r2_nakagawa() supports glmmTMB with zero-inflation
- r2_nakagawa() probably supports more packages (like GLMMadaptive or afex)

> Most of the time you can find a better statistic than R^2

I personally would recommend to look at different aspects, not only "fit measures", but also which model makes theoretically more sense. Furthermore, it depends on what you're focussing. If you're interested in the variance of your random effects, i.e. how much variation do you have in your outcome depending on different groups/clusters/subjects, comparing the conditional and marginal R2 makes more sense than looking at AIC (you could also directly look at the ICC in such a situation, which goes into a similar direction as marginal & conditional R2). Also, AIC is most useful when comparing models, but it gives you less information about a certain model itself, while R2 is indeed a useful measure for a single model.

Best
Daniel

-----Ursprüngliche Nachricht-----
Von: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces using r-project.org] Im Auftrag von Ades, James
Gesendet: Mittwoch, 26. Februar 2020 06:19
An: Maarten Jung <Maarten.Jung using mailbox.tu-dresden.de>
Cc: r-sig-mixed-models using r-project.org
Betreff: Re: [R-sig-ME] Most principled reporting of mixed-effect model regression coefficients

Thanks, Daniel and Maarten!

I looked at both Nakagawa and Schielzeth and the Johnson paper; I also looked through your other references...thanks for those. I really liked the linked Stack Exchange post of WHuber's lucid response to R^2.

Johnson references the MuMIn package, which I wasn't familiar with, though he writes that the function "r.squaredGLMM" takes into account the random slope (something that N & S mention as tedious and then wave aside). Using the N&S equation, for one of my models, I get an R^2 of .35, while using r.squaredGLMM, I get an R^2 of .43. I can't imagine that the random slope of time would make that big of a difference. (The conditional R^2 is .95, and I have no idea how it's that high). Does anyone have any experience with the package?

While some models (not for model selection but looking at PCA, individual variables, or some kind of aggregate measure for executive function)  have comparatively large differences in AIC; using R^2 via MuMIn, they might have differences of .01. In other words, what seemed to be decent (and significant with LRT) differences, with r.squaredGLMM they became inconsequential.

AIC seems to do a commendable job of yielding parsimony, but it's utter lack of comparability (with same # of observations) is frustrating. While an AIC of 28,620 is better than one with 28,645, there is, to my knowledge, no real way of quantifying that difference. Alas, while WHuber writes, "Most of the time you can find a better statistic than R^2. For model selection you  can look to AIC and BIC," I think the issue is not only in selecting models (which AIC seems to do quite well), but again, in summarizing those models in intuitively quantitative ways.

I've also looked into doing some kind of multiple time series cross validation though from what I've read (see below), this is similarly fraught. Maybe leave one out is the best way to go. The structure of the data has four timepoints with executive function data. The first two timepoints ('17 school year) and the final two timepoints ('18 school year) correspond to each year's standardized test.

Thanks much!

http://www.stat.columbia.edu/~gelman/research/published/final_sub.pdf
Di culty of selecting among multilevel models using predictive accuracy<http://www.stat.columbia.edu/~gelman/research/published/final_sub.pdf>
Statistics and Its Interface Volume 7 (2014) 1 Di culty of selecting among multilevel models using predictive accuracy Wei Wang and Andrew Gelman www.stat.columbia.edu<http://www.stat.columbia.edu>

https://dl.acm.org/doi/10.1016/j.ins.2011.12.028
On the use of cross-validation for time series predictor evaluation | Information Sciences: an International Journal<https://dl.acm.org/doi/10.1016/j.ins.2011.12.028>
In time series predictor evaluation, we observe that with respect to the model selection procedure there is a gap between evaluation of traditional forecasting procedures, on the one hand, and evaluation of machine learning techniques on the other hand.
dl.acm.org

https://onlinelibrary.wiley.com/doi/full/10.1111/ecog.02881
[https://onlinelibrary.wiley.com/cms/asset/e2a4b565-bd18-485c-8dff-a03abb9a0c13/ecog.2017.v40.i8.cover.gif]<https://onlinelibrary.wiley.com/doi/full/10.1111/ecog.02881>
Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure - Roberts - 2017 - Ecography - Wiley Online Library<https://onlinelibrary.wiley.com/doi/full/10.1111/ecog.02881>
Ideally, model validation, selection, and predictive errors should be calculated using independent data (Araújo et al. 2005).For example, validation may be undertaken with data from different geographic regions or spatially distinct subsets of the region, different time periods, such as historic species records from the recent past or from fossil records.
onlinelibrary.wiley.com

________________________________
From: Maarten Jung <Maarten.Jung using mailbox.tu-dresden.de>
Sent: Monday, February 17, 2020 1:35 AM
To: Ades, James <jades using health.ucsd.edu>
Cc: r-sig-mixed-models using r-project.org <r-sig-mixed-models using r-project.org>
Subject: Re: [R-sig-ME] Most principled reporting of mixed-effect model regression coefficients

> Thanks, Maarten. So I was planning on reporting R^2 (along with AIC) for the overall model fit, not for each predictor, since the regression coefficients themselves give a good indication of relationship (though I wasn't aware that R^2 is "riddled with complications") Is Henrik only saying this only with regard to LMMs and GLMMs?

That makes sense to me. For the overall model fit I would probably still go with Johnson's version [1] which I describe in my StackExchange post (and I think you mentioned it, or the Nakagawa and Schielzeth version it is based on, earlier) and report both the marginal and conditional R^2 values. The regression coefficients provide unstandardized effect sizes on the response scale which I think are a valid way to report effect sizes (see below).
I think Henrik refers to (G)LMMs and gives Rights & Sterba (2019) [2] as reference. Also, the GLMM FAQ website provides a good overview [3].

> When you say "there is no agreed upon way to calculate effect sizes" I'm a little confused. I read through your stack exchange posting, but Henrik's answer refers to standardized effect size. You write, later down, "Whenever possible, we report unstandardized effect sizes which is in line with general recommendation of how to report effect sizes"

What you cite is still Henrik's opinion (and I hoped that I could make this clear by writing "This is what he suggests [...]" and by using the <blockquote> on StackExchange). And your citation still refers to LMMs as he says "Unfortunately, due to the way that variance is partitioned in linear mixed models (e.g., Rights & Sterba, 2019), there does not exist an agreed upon way to calculate standard effect sizes for individual model terms such as main effects or interactions."
In general, I agree with him and with his recommendation to report unstandardized effect sizes (e.g. regression coefficients) if they have a "meaningful" interpretation.
The semi-partial R^2 I mentioned in my last e-mail is an additional/alternative indicator of effect sizes that is probably more in line with what psychologists are used to see reported in papers (especially when results of factorial designs are reported) - and that's the reason I mentioned it.

> I'm also working on a systematic review where there's disagreement over whether effect sizes should be standardized, but it does seem that yield any kind of meaningful comparison, effect sizes would have to be standardized. I don't usually report standardized effect sizes...however, there are times when I z-score IVs to put them on the same scale, and I guess the output of that would be a standardized effect size. I wasn't aware of push back on that practice. What issues would arise from this?

There is nothing wrong with standardizing (e.g. by diving by 1 or 2 standard deviations) predictor variables to get measures of variable importance (within the same model).
Issues arise when standardized effect sizes such as R^2, partial eta^2, etc. between different models are compared without thinking about what differences in these measures can be attributed to (see e.g. this question [4] or the Pek & Flora (2018) paper [5] that Henrik cites). Note that these are general issues that apply to all regression models, not only mixed models.

[1] https://doi.org/10.1111/2041-210X.12225
[2] https://doi.org/10.1037/met0000184
[3] https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#how-do-i-compute-a-coefficient-of-determination-r2-or-an-analogue-for-glmms
[4] https://stats.stackexchange.com/questions/13314/is-r2-useful-or-dangerous/13317
[5] https://doi.org/10.1037/met0000126

Best,
Maarten

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models using r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

--

_____________________________________________________________________

Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; Gerichtsstand: Hamburg | www.uke.de
Vorstandsmitglieder: Prof. Dr. Burkhard Göke (Vorsitzender), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Prölß, Marya Verdel
_____________________________________________________________________

SAVE PAPER - THINK BEFORE PRINTING

	[[alternative HTML version deleted]]