[R-sig-ME] Implications of modeling residuals in multilevel models
Ben Bolker
bbo|ker @end|ng |rom gm@||@com
Fri Dec 22 23:07:13 CET 2023
On 2023-12-22 4:50 p.m., Simon Harmel wrote:
> Dear Ben,
>
> These are exactly what I was looking for! Thank you so much. Please
> allow me to clarify my 2nd question.
>
> When I compare predict(MODEL) with predict(model_with_no_resid_modeled),
> the former looks quite normal and unimodal in shape but the latter looks
> multi-model (see the illustration below).
>
> This made me think: should a good-fitting model produce normally
> distributed predicted values? And does such a model do a better job of
> predicting the value of random draw from the y population?
Ideally the *marginal* (overall) distribution of *predictions*
should match the marginal distribution of the response variable (this is
what the 'posterior predictive check' panel of
performance::check_model() does). If the marginal distribution of your
data is skewed, then the marginal distribution of your prediction should
be skewed in the same way -- if it's not, then it's not doing a good job
describing the data.
m <- lm(mpg ~ cyl + disp, mtcars)
performance::check_model(m, check = "pp_check")
This is a different story from the *residuals* (which tell you
something about the conditional distribution, not the marginal
distribution ...)
> *
> * * *
> * * * * * * *
> * * * * * * * * * *
> * * * * * * * * * * * *
> ------------------------- vs:
> *
> * *
> * * * *
> * * * * * * * *
> * * * * * * * * * * * *
> -------------------------
>
> On Fri, Dec 22, 2023 at 12:30 PM Ben Bolker <bbolker using gmail.com
> <mailto:bbolker using gmail.com>> wrote:
>
> This slipped through the cracks.
>
> On 2023-12-22 1:10 p.m., Simon Harmel wrote:
> > Hello All,
> >
> > Just a follow-up, can we say the model I sketched above is a
> location-scale
> > model?
> >
> > Thanks,
> > Simon
> >
> > On Sat, Dec 16, 2023 at 9:30 PM Simon Harmel
> <sim.harmel using gmail.com <mailto:sim.harmel using gmail.com>> wrote:
> >
> >> Hello All,
> >>
> >> I have a highly skewed dataset.
>
> As may have been said before, the *marginal* distribution of your
> data set (e.g., what you get if you plot hist(y)) is irrelevant/doesn't
> matter at all (the models make no assumptions about the marginal
> distribution)
>
> But, my MODEL of choice below shows
> >> drastically improved, normally distributed residuals (and
> predicted values)
> >> compared to other models whose residuals are not modeled.
> >>
> >> Three quick questions:
> >>
> >> 1- Is MODEL below assuming that my data come from a population
> that looks
> >> normal once "X1_categorical" and "X2_numeric" are taken into account
> >> as modeled in MODEL?
>
> Yes, the *conditional distribution* of y is Gaussian (but not IID
> because you have specified correlation and variance (weights)
> structures).
>
>
> >> 2- Do these normally distributed predictions work better for a
> subject
> >> randomly drawn from a similarly skewed population with a
> >> known "X1_categorical" and "X2_numeric"?
>
> Don't think I understand what this means.
>
> >> 3- I think the distribution of residuals mirrors that of the
> data. If so,
> >> is it correct to say MODEL below is actually **trimming** my
> highly skewed
> >> data as if it was distributed as:
>
> That seems an odd way to put it. I would say that, if the
> residuals
> look Normally distributed, that means that the covariates in the model
> are accounting for the skew.
>
> Here's an example of a dataset where the marginal distribution is
> heavily skewed but the conditional distribution is (exactly) Normal:
>
> set.seed(101)
> x <- rgamma(1000, shape = 1)
> y <- rnorm(1000, mean = x, sd = 0.1)
> hist(y)
>
> qqnorm(residuals(lm(y~x)))
>
> Heteroscedasticity (but not autocorrelation) will also change the
> distribution of the residuals, but I don't think it's necessary to
> illustrate the point here.
>
> I think it would be fair to call this a location-scale model since
> you are modeling both the location (mean) and the scale (SD/variance),
> although often people use the term for models where there is a random
> effect in both the location and the scale model.
>
>
>
>
>
> >> ```
> >> hist(resid(MODEL, type = "normalized"))
> >> ```
> >>
> >> MODEL <- nlme::lme(y ~ X1_categorical + X2_numeric ...,
> >> random = ~1| subject,
> >> data = data,
> >> correlation = corSymm(~1|subject),
> >> weights = varComb(varIdent(form = ~ 1 | X1_categorical ),
> >> varPower(form = ~
> X2_numeric )))
> >>
> >> Thanks,
> >> Simon
> >>
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-mixed-models using r-project.org
> <mailto:R-sig-mixed-models using r-project.org> mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
>
> --
> Dr. Benjamin Bolker
> Professor, Mathematics & Statistics and Biology, McMaster University
> Director, School of Computational Science and Engineering
> (Acting) Graduate chair, Mathematics & Statistics
> > E-mail is sent at my convenience; I don't expect replies outside of
> working hours.
>
> _______________________________________________
> R-sig-mixed-models using r-project.org
> <mailto:R-sig-mixed-models using r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
>
--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
> E-mail is sent at my convenience; I don't expect replies outside of
working hours.
More information about the R-sig-mixed-models
mailing list