[R-sig-ME] Most principled reporting of mixed-effect model regression coefficients
j@de@ @end|ng |rom he@|th@uc@d@edu
Wed Feb 26 06:19:09 CET 2020
Thanks, Daniel and Maarten!
I looked at both Nakagawa and Schielzeth and the Johnson paper; I also looked through your other references...thanks for those. I really liked the linked Stack Exchange post of WHuber's lucid response to R^2.
Johnson references the MuMIn package, which I wasn't familiar with, though he writes that the function "r.squaredGLMM" takes into account the random slope (something that N & S mention as tedious and then wave aside). Using the N&S equation, for one of my models, I get an R^2 of .35, while using r.squaredGLMM, I get an R^2 of .43. I can't imagine that the random slope of time would make that big of a difference. (The conditional R^2 is .95, and I have no idea how it's that high). Does anyone have any experience with the package?
While some models (not for model selection but looking at PCA, individual variables, or some kind of aggregate measure for executive function) have comparatively large differences in AIC; using R^2 via MuMIn, they might have differences of .01. In other words, what seemed to be decent (and significant with LRT) differences, with r.squaredGLMM they became inconsequential.
AIC seems to do a commendable job of yielding parsimony, but it's utter lack of comparability (with same # of observations) is frustrating. While an AIC of 28,620 is better than one with 28,645, there is, to my knowledge, no real way of quantifying that difference. Alas, while WHuber writes, "Most of the time you can find a better statistic than R^2. For model selection you can look to AIC and BIC," I think the
issue is not only in selecting models (which AIC seems to do quite well), but again, in
summarizing those models in intuitively quantitative ways.
I've also looked into doing some kind of multiple time series cross validation
though from what I've read (see below), this is similarly fraught. Maybe leave one out is
the best way to go. The structure of the data has four timepoints with executive function
data. The first two timepoints ('17 school year) and the final two timepoints ('18 school year)
correspond to each year's standardized test.
Di culty of selecting among multilevel models using predictive accuracy<http://www.stat.columbia.edu/~gelman/research/published/final_sub.pdf>
Statistics and Its Interface Volume 7 (2014) 1 Di culty of selecting among multilevel models using predictive accuracy Wei Wang and Andrew Gelman
On the use of cross-validation for time series predictor evaluation | Information Sciences: an International Journal<https://dl.acm.org/doi/10.1016/j.ins.2011.12.028>
In time series predictor evaluation, we observe that with respect to the model selection procedure there is a gap between evaluation of traditional forecasting procedures, on the one hand, and evaluation of machine learning techniques on the other hand.
Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure - Roberts - 2017 - Ecography - Wiley Online Library<https://onlinelibrary.wiley.com/doi/full/10.1111/ecog.02881>
Ideally, model validation, selection, and predictive errors should be calculated using independent data (Araújo et al. 2005).For example, validation may be undertaken with data from different geographic regions or spatially distinct subsets of the region, different time periods, such as historic species records from the recent past or from fossil records.
From: Maarten Jung <Maarten.Jung using mailbox.tu-dresden.de>
Sent: Monday, February 17, 2020 1:35 AM
To: Ades, James <jades using health.ucsd.edu>
Cc: r-sig-mixed-models using r-project.org <r-sig-mixed-models using r-project.org>
Subject: Re: [R-sig-ME] Most principled reporting of mixed-effect model regression coefficients
> Thanks, Maarten. So I was planning on reporting R^2 (along with AIC) for the overall model fit, not for each predictor, since the regression coefficients themselves give a good indication of relationship (though I wasn't aware that R^2 is "riddled with complications") Is Henrik only saying this only with regard to LMMs and GLMMs?
That makes sense to me. For the overall model fit I would probably
still go with Johnson's version  which I describe in my
StackExchange post (and I think you mentioned it, or the Nakagawa and
Schielzeth version it is based on, earlier) and report both the
marginal and conditional R^2 values. The regression coefficients
provide unstandardized effect sizes on the response scale which I
think are a valid way to report effect sizes (see below).
I think Henrik refers to (G)LMMs and gives Rights & Sterba (2019) 
as reference. Also, the GLMM FAQ website provides a good overview .
> When you say "there is no agreed upon way to calculate effect sizes" I'm a little confused. I read through your stack exchange posting, but Henrik's answer refers to standardized effect size. You write, later down, "Whenever possible, we report unstandardized effect sizes which is in line with general recommendation of how to report effect sizes"
What you cite is still Henrik's opinion (and I hoped that I could make
this clear by writing "This is what he suggests [...]" and by using
the <blockquote> on StackExchange). And your citation still refers to
LMMs as he says "Unfortunately, due to the way that variance is
partitioned in linear mixed models (e.g., Rights & Sterba, 2019),
there does not exist an agreed upon way to calculate standard effect
sizes for individual model terms such as main effects or
In general, I agree with him and with his recommendation to report
unstandardized effect sizes (e.g. regression coefficients) if they
have a "meaningful" interpretation.
The semi-partial R^2 I mentioned in my last e-mail is an
additional/alternative indicator of effect sizes that is probably more
in line with what psychologists are used to see reported in papers
(especially when results of factorial designs are reported) - and
that's the reason I mentioned it.
> I'm also working on a systematic review where there's disagreement over whether effect sizes should be standardized, but it does seem that yield any kind of meaningful comparison, effect sizes would have to be standardized. I don't usually report standardized effect sizes...however, there are times when I z-score IVs to put them on the same scale, and I guess the output of that would be a standardized effect size. I wasn't aware of push back on that practice. What issues would arise from this?
There is nothing wrong with standardizing (e.g. by diving by 1 or 2
standard deviations) predictor variables to get measures of variable
importance (within the same model).
Issues arise when standardized effect sizes such as R^2, partial
eta^2, etc. between different models are compared without thinking
about what differences in these measures can be attributed to (see
e.g. this question  or the Pek & Flora (2018) paper  that Henrik
cites). Note that these are general issues that apply to all
regression models, not only mixed models.
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models