[R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x

Mon Oct 3 14:02:12 CEST 2016

Dear Carlos,

I concur with Paul. After play a bit with the data you send me privately, I
see a few things which cause problems:
1) the number of measurements per ID is low. 1/3 has one measurement in
each level of C, 1/3 in two out of three levels of C and 1/3 in only one
level of C.
2) the variance of ID is larger than the residual variance
3) the effect of W is small compared to the variance of ID

If possible try to add extra covariates. If not I'd fall back on a simple
lm. Either with ignoring the repeated measurements or by sampling the data
so you have only one observation per ID.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson op glasgow.ac.uk>:

> Hi Carlos,
>
> Your plot didn’t come through, as Thierry noted. However it’s expected
> that, unlike a standard linear regression model, an LMM with a nested
> structure such as yours will give a positive linear relationship between
> the residuals and the fitted values (might also be true for other
> structures?), provided the residual and random effect variances are > 0.
> Somebody will hopefully chip in with a formal explanation, but it’s
> basically a similar phenomenon to regression to the mean.
>
> Imagine a group of students taking each taking an aptitude test three
> times. There are two random factors: the difference in underlying aptitude
> between the students, modelled by the student ID random effect; and random
> variation between time points within each student (e.g. how good a
> particular student is feeling on a particular day). I’m ignoring variation
> between tests — let’s unrealistically assume they’s all the same and
> students completely forget about them between tests.
>
> The students with the best mean scores will be a mixture of excellent
> students having three so-so (some good, some bad) days, and moderately good
> students having the good luck to have three good days, and the very best
> scores will come from students who were both excellent and lucky (although
> this category will be small). An important point is that there is no way of
> using the data to separate the moderate-student-lucky-days high scores from
> the excellent-student-average-days scores. If we simply took the mean of
> the scores, we would be overestimating the performance of the students on
> average (we’d have good estimates of the excellent students and
> overestimates of the moderate ones), so the best estimate is achieved by
> shrinking the scores towards the mean.
>
> This is what happens when the model is fitted. Each student is given a
> residual (random effect) at the student level (how good the student is
> relative to the value predicted by the fixed effects) and three residuals
> at the observation (between-test-within-student) level. For students with
> good scores, this will be a compromise between the inseparable
> excellent-student-average-days scenario and the moderate-student-lucky-days
> scenario. As a result, students with high student-level residuals (the
> student random effects) will also tend to have high inter-test residuals.
> The same is also true in negative for poor students and students having
> three bad days. So the student random effects (which are part of the fitted
> values) and the residuals will be positively correlated.
>
> You can check this using by simulating new data from the fitted model
> re-fitting the model, and comparing the residuals-x-fitted plot (which will
> be "perfect”) to the one from your data. Here’s a function that does this
> for lme4 fits:
>
> devtools::install_github("pcdjohnson/GLMMmisc")
> library(GLMMmisc)
> sim.residplot(fit) # repeat this a few times to account for sampling error
>
> If all is well, you should see a similar slope between the real and the
> simulated plots, in fact the general pattern of the residuals should be
> similar.
>
> (The new package DHARMa — https://theoreticalecology.
> wordpress.com/2016/08/28/dharma-an-r-package-for-
> residual-diagnostics-of-glmms/ — takes a similar approach to assessing
> residuals, but in a less quick-and-dirty, more formally justified way.)
>
> All the best,
> Paul
>
>
>
>
> > On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia op gmail.com> wrote:
> >
> > Hello,
> >
> > I have in hands a quite large and unbalanced dataset, for which a Y
> continuous dependent variable was measured in 3 different conditions (C)
> for about 3000 subjects (ID) (although, not all subjects have Y values for
> the 3 conditions). Additionally, there is continuous measure W which was
> measured for all subjects.
> >
> > I am interested in testing the following:
> >
> > - Is the effect of W significant overall
> > - Is the effect of W significant at each level of C
> > - Is the effect of C significant
> >
> > In order to try to answer this, I have specified the following model
> with lmer:
> >
> > lmer( Y ~ W * C + (1 | ID), data = df)
> >
> > Which seems to proper reflect the structure of the data (I might be
> wrong here, any suggestions would be welcome).
> > However when running the diagnostic plots I noticed a slope in the
> residuals plot and a slope different than y = x for the observed vs fitted
> plot (as shown bellow). Which made me question the validity of the model
> for inference.
> >
> > Could I still use this model for inference? Should I specify a different
> formula? Should I turn to lme and try to include different variances for
> each level of conditions (C)? Any ideas?
> >
> > I would be really appreciated if anyone could help me with this.
> >
> > Thanks in advance,
> > Carlos Família
> >
> > _______________________________________________
> > R-sig-mixed-models op r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
> _______________________________________________
> R-sig-mixed-models op r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

	[[alternative HTML version deleted]]