[R-sig-ME] Implications of modeling residuals in multilevel models
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Thu Jan 4 09:05:58 CET 2024
>>>>> Ben Bolker
>>>>> on Fri, 22 Dec 2023 17:07:13 -0500 writes:
> On 2023-12-22 4:50 p.m., Simon Harmel wrote:
>> Dear Ben,
>>
>> These are exactly what I was looking for! Thank you so much. Please
>> allow me to clarify my 2nd question.
>>
>> When I compare predict(MODEL) with predict(model_with_no_resid_modeled),
>> the former looks quite normal and unimodal in shape but the latter looks
>> multi-model (see the illustration below).
>>
>> This made me think: should a good-fitting model produce normally
>> distributed predicted values? And does such a model do a better job of
>> predicting the value of random draw from the y population?
> Ideally the *marginal* (overall) distribution of *predictions*
> should match the marginal distribution of the response variable (this is
> what the 'posterior predictive check' panel of
> performance::check_model() does).
Well, only approximately and only if you have enough degrees of freedom.
Remember the simple (homo schedastic) LS case where
Cov(Y^) = \sigma^2 * (I - P)
P = H is the projection / hat matrix X' (X'X)^{-1} X
So, the variances of Y^_i are *smaller* than those of the Y_i
> If the marginal distribution of your
> data is skewed, then the marginal distribution of your prediction should
> be skewed in the same way -- if it's not, then it's not doing a good job
> describing the data.
> m <- lm(mpg ~ cyl + disp, mtcars)
> performance::check_model(m, check = "pp_check")
> This is a different story from the *residuals* (which tell you
> something about the conditional distribution, not the marginal
> distribution ...)
>> *
>> * * *
>> * * * * * * *
>> * * * * * * * * * *
>> * * * * * * * * * * * *
>> ------------------------- vs:
>> *
>> * *
>> * * * *
>> * * * * * * * *
>> * * * * * * * * * * * *
>> -------------------------
>>
>> On Fri, Dec 22, 2023 at 12:30 PM Ben Bolker <bbolker using gmail.com
>> <mailto:bbolker using gmail.com>> wrote:
>>
>> This slipped through the cracks.
>>
>> On 2023-12-22 1:10 p.m., Simon Harmel wrote:
>> > Hello All,
>> >
>> > Just a follow-up, can we say the model I sketched above is a
>> location-scale
>> > model?
>> >
>> > Thanks,
>> > Simon
>> >
>> > On Sat, Dec 16, 2023 at 9:30 PM Simon Harmel
>> <sim.harmel using gmail.com <mailto:sim.harmel using gmail.com>> wrote:
>> >
>> >> Hello All,
>> >>
>> >> I have a highly skewed dataset.
>>
>> As may have been said before, the *marginal* distribution of your
>> data set (e.g., what you get if you plot hist(y)) is irrelevant/doesn't
>> matter at all (the models make no assumptions about the marginal
>> distribution)
>>
>> But, my MODEL of choice below shows
>> >> drastically improved, normally distributed residuals (and
>> predicted values)
>> >> compared to other models whose residuals are not modeled.
>> >>
>> >> Three quick questions:
>> >>
>> >> 1- Is MODEL below assuming that my data come from a population
>> that looks
>> >> normal once "X1_categorical" and "X2_numeric" are taken into account
>> >> as modeled in MODEL?
>>
>> Yes, the *conditional distribution* of y is Gaussian (but not IID
>> because you have specified correlation and variance (weights)
>> structures).
>>
>>
>> >> 2- Do these normally distributed predictions work better for a
>> subject
>> >> randomly drawn from a similarly skewed population with a
>> >> known "X1_categorical" and "X2_numeric"?
>>
>> Don't think I understand what this means.
>>
>> >> 3- I think the distribution of residuals mirrors that of the
>> data. If so,
>> >> is it correct to say MODEL below is actually **trimming** my
>> highly skewed
>> >> data as if it was distributed as:
>>
>> That seems an odd way to put it. I would say that, if the
>> residuals
>> look Normally distributed, that means that the covariates in the model
>> are accounting for the skew.
>>
>> Here's an example of a dataset where the marginal distribution is
>> heavily skewed but the conditional distribution is (exactly) Normal:
>>
>> set.seed(101)
>> x <- rgamma(1000, shape = 1)
>> y <- rnorm(1000, mean = x, sd = 0.1)
>> hist(y)
>>
>> qqnorm(residuals(lm(y~x)))
>>
>> Heteroscedasticity (but not autocorrelation) will also change the
>> distribution of the residuals, but I don't think it's necessary to
>> illustrate the point here.
>>
>> I think it would be fair to call this a location-scale model since
>> you are modeling both the location (mean) and the scale (SD/variance),
>> although often people use the term for models where there is a random
>> effect in both the location and the scale model.
>>
>>
>>
>>
>>
>> >> ```
>> >> hist(resid(MODEL, type = "normalized"))
>> >> ```
>> >>
>> >> MODEL <- nlme::lme(y ~ X1_categorical + X2_numeric ...,
>> >> random = ~1| subject,
>> >> data = data,
>> >> correlation = corSymm(~1|subject),
>> >> weights = varComb(varIdent(form = ~ 1 | X1_categorical ),
>> >> varPower(form = ~
>> X2_numeric )))
>> >>
>> >> Thanks,
>> >> Simon
>> >>
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > R-sig-mixed-models using r-project.org
>> <mailto:R-sig-mixed-models using r-project.org> mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
>>
>> --
>> Dr. Benjamin Bolker
>> Professor, Mathematics & Statistics and Biology, McMaster University
>> Director, School of Computational Science and Engineering
>> (Acting) Graduate chair, Mathematics & Statistics
>> > E-mail is sent at my convenience; I don't expect replies outside of
>> working hours.
>>
>> _______________________________________________
>> R-sig-mixed-models using r-project.org
>> <mailto:R-sig-mixed-models using r-project.org> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
>>
> --
> Dr. Benjamin Bolker
> Professor, Mathematics & Statistics and Biology, McMaster University
> Director, School of Computational Science and Engineering
> (Acting) Graduate chair, Mathematics & Statistics
>> E-mail is sent at my convenience; I don't expect replies outside of
> working hours.
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
More information about the R-sig-mixed-models
mailing list