[R-sig-ME] Implications of modeling residuals in multilevel models

Thu Jan 4 09:05:58 CET 2024

>>>>> Ben Bolker 
>>>>>     on Fri, 22 Dec 2023 17:07:13 -0500 writes:

    > On 2023-12-22 4:50 p.m., Simon Harmel wrote:
    >> Dear Ben,
    >> 
    >> These are exactly what I was looking for! Thank you so much. Please 
    >> allow me to clarify my 2nd question.
    >> 
    >> When I compare predict(MODEL) with predict(model_with_no_resid_modeled), 
    >> the former looks quite normal and unimodal in shape but the latter looks 
    >> multi-model (see the illustration below).
    >> 
    >> This made me think: should a good-fitting model produce normally 
    >> distributed predicted values? And does such a model do a better job of 
    >> predicting the value of random draw from the y population?

    > Ideally the *marginal* (overall) distribution of *predictions* 
    > should match the marginal distribution of the response variable (this is 
    > what the 'posterior predictive check' panel of 
    > performance::check_model() does).  

Well, only approximately and only if you have enough degrees of freedom.
Remember the simple (homo schedastic) LS case where

    Cov(Y^) = \sigma^2 * (I - P)

P = H is the projection / hat matrix  X' (X'X)^{-1} X
So, the variances of Y^_i are *smaller* than those of the Y_i

    > If the marginal distribution of your 
    > data is skewed, then the marginal distribution of your prediction should 
    > be skewed in the same way -- if it's not, then it's not doing a good job 
    > describing the data.

    > m <- lm(mpg ~ cyl + disp, mtcars)
    > performance::check_model(m, check = "pp_check")

    > This is a different story from the *residuals* (which tell you 
    > something about the conditional distribution, not the marginal 
    > distribution ...)

    >>                *
    >>             * * *
    >>        * * * * * * *
    >>    * * * * * * * * * *
    >> * * * * * * * * * * * *
    >> -------------------------  vs:
    >>                         *
    >>                       * *
    >>       *             * * *
    >>    * *   * *     * * * *
    >> * * * * * * * * * * * *
    >> -------------------------
    >> 
    >> On Fri, Dec 22, 2023 at 12:30 PM Ben Bolker <bbolker using gmail.com 
    >> <mailto:bbolker using gmail.com>> wrote:
    >> 
    >>    This slipped through the cracks.
    >> 
    >> On 2023-12-22 1:10 p.m., Simon Harmel wrote:
    >> > Hello All,
    >> >
    >> > Just a follow-up, can we say the model I sketched above is a
    >> location-scale
    >> > model?
    >> >
    >> > Thanks,
    >> > Simon
    >> >
    >> > On Sat, Dec 16, 2023 at 9:30 PM Simon Harmel
    >> <sim.harmel using gmail.com <mailto:sim.harmel using gmail.com>> wrote:
    >> >
    >> >> Hello All,
    >> >>
    >> >> I have a highly skewed dataset.
    >> 
    >>    As may have been said before, the *marginal* distribution of your
    >> data set (e.g., what you get if you plot hist(y)) is irrelevant/doesn't
    >> matter at all (the models make no assumptions about the marginal
    >> distribution)
    >> 
    >>   But, my MODEL of choice below shows
    >> >> drastically improved, normally distributed residuals (and
    >> predicted values)
    >> >> compared to other models whose residuals are not modeled.
    >> >>
    >> >> Three quick questions:
    >> >>
    >> >> 1- Is MODEL below assuming that my data come from a population
    >> that looks
    >> >> normal once "X1_categorical" and "X2_numeric" are taken into account
    >> >> as modeled in MODEL?
    >> 
    >>    Yes, the *conditional distribution* of y is Gaussian (but not IID
    >> because you have specified correlation and variance (weights)
    >> structures).
    >> 
    >> 
    >> >> 2- Do these normally distributed predictions work better for a
    >> subject
    >> >> randomly drawn from a similarly skewed population with a
    >> >> known "X1_categorical" and "X2_numeric"?
    >> 
    >>    Don't think I understand what this means.
    >> 
    >> >> 3- I think the distribution of residuals mirrors that of the
    >> data. If so,
    >> >> is it correct to say MODEL below is actually **trimming** my
    >> highly skewed
    >> >> data as if it was distributed as:
    >> 
    >>    That seems an odd way to put it.  I would say that, if the
    >> residuals
    >> look Normally distributed, that means that the covariates in the model
    >> are accounting for the skew.
    >> 
    >> Here's an example of a dataset where the marginal distribution is
    >> heavily skewed but the conditional distribution is (exactly) Normal:
    >> 
    >> set.seed(101)
    >> x <- rgamma(1000, shape = 1)
    >> y <- rnorm(1000, mean = x, sd = 0.1)
    >> hist(y)
    >> 
    >> qqnorm(residuals(lm(y~x)))
    >> 
    >>    Heteroscedasticity (but not autocorrelation) will also change the
    >> distribution of the residuals, but I don't think it's necessary to
    >> illustrate the point here.
    >> 
    >>    I think it would be fair to call this a location-scale model since
    >> you are modeling both the location (mean) and the scale (SD/variance),
    >> although often people use the term for models where there is a random
    >> effect in both the location and the scale model.
    >> 
    >> 
    >> 
    >> 
    >> 
    >> >> ```
    >> >> hist(resid(MODEL, type = "normalized"))
    >> >> ```
    >> >>
    >> >>   MODEL <- nlme::lme(y ~ X1_categorical + X2_numeric ...,
    >> >>           random = ~1| subject,
    >> >>           data = data,
    >> >>           correlation = corSymm(~1|subject),
    >> >>           weights = varComb(varIdent(form = ~ 1 |  X1_categorical ),
    >> >>                                            varPower(form = ~ 
    >> X2_numeric )))
    >> >>
    >> >> Thanks,
    >> >> Simon
    >> >>
    >> >
    >> >       [[alternative HTML version deleted]]
    >> >
    >> > _______________________________________________
    >> > R-sig-mixed-models using r-project.org
    >> <mailto:R-sig-mixed-models using r-project.org> mailing list
    >> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
    >> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
    >> 
    >> -- 
    >> Dr. Benjamin Bolker
    >> Professor, Mathematics & Statistics and Biology, McMaster University
    >> Director, School of Computational Science and Engineering
    >> (Acting) Graduate chair, Mathematics & Statistics
    >>  > E-mail is sent at my convenience; I don't expect replies outside of
    >> working hours.
    >> 
    >> _______________________________________________
    >> R-sig-mixed-models using r-project.org
    >> <mailto:R-sig-mixed-models using r-project.org> mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
    >> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
    >> 

    > -- 
    > Dr. Benjamin Bolker
    > Professor, Mathematics & Statistics and Biology, McMaster University
    > Director, School of Computational Science and Engineering
    > (Acting) Graduate chair, Mathematics & Statistics
    >> E-mail is sent at my convenience; I don't expect replies outside of 
    > working hours.

    > _______________________________________________
    > R-sig-mixed-models using r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models