[R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x

Mon Oct 3 11:17:59 CEST 2016

Hi Carlos,

Your plot didn’t come through, as Thierry noted. However it’s expected that, unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it’s basically a similar phenomenon to regression to the mean. 

Imagine a group of students taking each taking an aptitude test three times. There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I’m ignoring variation between tests — let’s unrealistically assume they’s all the same and students completely forget about them between tests. 

The students with the best mean scores will be a mixture of excellent students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we’d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean. 

This is what happens when the model is fitted. Each student is given a residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student-average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated. 

You can check this using by simulating new data from the fitted model re-fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect”) to the one from your data. Here’s a function that does this for lme4 fits:

devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling error

If all is well, you should see a similar slope between the real and the simulated plots, in fact the general pattern of the residuals should be similar.

(The new package DHARMa — https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/ — takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.)

All the best,
Paul

> On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com> wrote:
> 
> Hello,
> 
> I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects.
> 
> I am interested in testing the following: 
> 
> - Is the effect of W significant overall
> - Is the effect of W significant at each level of C
> - Is the effect of C significant
> 
> In order to try to answer this, I have specified the following model with lmer:
> 
> lmer( Y ~ W * C + (1 | ID), data = df)
> 
> Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome).
> However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference.
> 
> Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas?
> 
> I would be really appreciated if anyone could help me with this.
> 
> Thanks in advance,
> Carlos Família
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models