[R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x
thierry.onkelinx at inbo.be
Mon Oct 3 14:10:05 CEST 2016
In case you have more than one measurement per ID, select one of them at
random. Something like
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
2016-10-03 14:06 GMT+02:00 Carlos Familia <carlosfamilia op gmail.com>:
> Dear Thierry,
> When you say sampling the data to have only one observation per ID, you
> mean reducing the dataset to cases where the Y variable was only measured
> in 1 condition?
> I have though about going lm with this, but it just didn’t feel wright...
> Many thanks,
> On 3 Oct 2016, at 13:02, Thierry Onkelinx <thierry.onkelinx op inbo.be>
> Dear Carlos,
> I concur with Paul. After play a bit with the data you send me privately,
> I see a few things which cause problems:
> 1) the number of measurements per ID is low. 1/3 has one measurement in
> each level of C, 1/3 in two out of three levels of C and 1/3 in only one
> level of C.
> 2) the variance of ID is larger than the residual variance
> 3) the effect of W is small compared to the variance of ID
> If possible try to add extra covariates. If not I'd fall back on a simple
> lm. Either with ignoring the repeated measurements or by sampling the data
> so you have only one observation per ID.
> Best regards,
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
> 2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson op glasgow.ac.uk>:
>> Hi Carlos,
>> Your plot didn’t come through, as Thierry noted. However it’s expected
>> that, unlike a standard linear regression model, an LMM with a nested
>> structure such as yours will give a positive linear relationship between
>> the residuals and the fitted values (might also be true for other
>> structures?), provided the residual and random effect variances are > 0.
>> Somebody will hopefully chip in with a formal explanation, but it’s
>> basically a similar phenomenon to regression to the mean.
>> Imagine a group of students taking each taking an aptitude test three
>> times. There are two random factors: the difference in underlying aptitude
>> between the students, modelled by the student ID random effect; and random
>> variation between time points within each student (e.g. how good a
>> particular student is feeling on a particular day). I’m ignoring variation
>> between tests — let’s unrealistically assume they’s all the same and
>> students completely forget about them between tests.
>> The students with the best mean scores will be a mixture of excellent
>> students having three so-so (some good, some bad) days, and moderately good
>> students having the good luck to have three good days, and the very best
>> scores will come from students who were both excellent and lucky (although
>> this category will be small). An important point is that there is no way of
>> using the data to separate the moderate-student-lucky-days high scores from
>> the excellent-student-average-days scores. If we simply took the mean of
>> the scores, we would be overestimating the performance of the students on
>> average (we’d have good estimates of the excellent students and
>> overestimates of the moderate ones), so the best estimate is achieved by
>> shrinking the scores towards the mean.
>> This is what happens when the model is fitted. Each student is given a
>> residual (random effect) at the student level (how good the student is
>> relative to the value predicted by the fixed effects) and three residuals
>> at the observation (between-test-within-student) level. For students with
>> good scores, this will be a compromise between the inseparable
>> excellent-student-average-days scenario and the moderate-student-lucky-days
>> scenario. As a result, students with high student-level residuals (the
>> student random effects) will also tend to have high inter-test residuals.
>> The same is also true in negative for poor students and students having
>> three bad days. So the student random effects (which are part of the fitted
>> values) and the residuals will be positively correlated.
>> You can check this using by simulating new data from the fitted model
>> re-fitting the model, and comparing the residuals-x-fitted plot (which will
>> be "perfect”) to the one from your data. Here’s a function that does this
>> for lme4 fits:
>> sim.residplot(fit) # repeat this a few times to account for sampling error
>> If all is well, you should see a similar slope between the real and the
>> simulated plots, in fact the general pattern of the residuals should be
>> (The new package DHARMa — https://theoreticalecology.wor
>> diagnostics-of-glmms/ — takes a similar approach to assessing residuals,
>> but in a less quick-and-dirty, more formally justified way.)
>> All the best,
>> > On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia op gmail.com>
>> > Hello,
>> > I have in hands a quite large and unbalanced dataset, for which a Y
>> continuous dependent variable was measured in 3 different conditions (C)
>> for about 3000 subjects (ID) (although, not all subjects have Y values for
>> the 3 conditions). Additionally, there is continuous measure W which was
>> measured for all subjects.
>> > I am interested in testing the following:
>> > - Is the effect of W significant overall
>> > - Is the effect of W significant at each level of C
>> > - Is the effect of C significant
>> > In order to try to answer this, I have specified the following model
>> with lmer:
>> > lmer( Y ~ W * C + (1 | ID), data = df)
>> > Which seems to proper reflect the structure of the data (I might be
>> wrong here, any suggestions would be welcome).
>> > However when running the diagnostic plots I noticed a slope in the
>> residuals plot and a slope different than y = x for the observed vs fitted
>> plot (as shown bellow). Which made me question the validity of the model
>> for inference.
>> > Could I still use this model for inference? Should I specify a
>> different formula? Should I turn to lme and try to include different
>> variances for each level of conditions (C)? Any ideas?
>> > I would be really appreciated if anyone could help me with this.
>> > Thanks in advance,
>> > Carlos Família
>> > _______________________________________________
>> > R-sig-mixed-models op r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> R-sig-mixed-models op r-project.org mailing list
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models