[R-sig-ME] Implications of modeling residuals in multilevel models

Fri Dec 22 19:30:01 CET 2023

   This slipped through the cracks.

On 2023-12-22 1:10 p.m., Simon Harmel wrote:
> Hello All,
> 
> Just a follow-up, can we say the model I sketched above is a location-scale
> model?
> 
> Thanks,
> Simon
> 
> On Sat, Dec 16, 2023 at 9:30 PM Simon Harmel <sim.harmel using gmail.com> wrote:
> 
>> Hello All,
>>
>> I have a highly skewed dataset.

   As may have been said before, the *marginal* distribution of your 
data set (e.g., what you get if you plot hist(y)) is irrelevant/doesn't 
matter at all (the models make no assumptions about the marginal 
distribution)

  But, my MODEL of choice below shows
>> drastically improved, normally distributed residuals (and predicted values)
>> compared to other models whose residuals are not modeled.
>>
>> Three quick questions:
>>
>> 1- Is MODEL below assuming that my data come from a population that looks
>> normal once "X1_categorical" and "X2_numeric" are taken into account
>> as modeled in MODEL?

   Yes, the *conditional distribution* of y is Gaussian (but not IID 
because you have specified correlation and variance (weights) structures).

>> 2- Do these normally distributed predictions work better for a subject
>> randomly drawn from a similarly skewed population with a
>> known "X1_categorical" and "X2_numeric"?

   Don't think I understand what this means.

>> 3- I think the distribution of residuals mirrors that of the data. If so,
>> is it correct to say MODEL below is actually **trimming** my highly skewed
>> data as if it was distributed as:

   That seems an odd way to put it.  I would say that, if the residuals 
look Normally distributed, that means that the covariates in the model 
are accounting for the skew.

Here's an example of a dataset where the marginal distribution is 
heavily skewed but the conditional distribution is (exactly) Normal:

set.seed(101)
x <- rgamma(1000, shape = 1)
y <- rnorm(1000, mean = x, sd = 0.1)
hist(y)

qqnorm(residuals(lm(y~x)))

   Heteroscedasticity (but not autocorrelation) will also change the 
distribution of the residuals, but I don't think it's necessary to 
illustrate the point here.

   I think it would be fair to call this a location-scale model since 
you are modeling both the location (mean) and the scale (SD/variance), 
although often people use the term for models where there is a random 
effect in both the location and the scale model.

>> ```
>> hist(resid(MODEL, type = "normalized"))
>> ```
>>
>>   MODEL <- nlme::lme(y ~ X1_categorical + X2_numeric ...,
>>           random = ~1| subject,
>>           data = data,
>>           correlation = corSymm(~1|subject),
>>           weights = varComb(varIdent(form = ~ 1 |  X1_categorical ),
>>                                            varPower(form = ~  X2_numeric )))
>>
>> Thanks,
>> Simon
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-- 
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
 > E-mail is sent at my convenience; I don't expect replies outside of 
working hours.