[R-sig-ME] LMM diagnostics: conditional residuals correlated highly with fitted values

Yizhou Ma maxxx848 at umn.edu
Wed Oct 7 18:05:12 CEST 2015


Hi Thierry,

Thank you for clarifying. I agree that high skewness can lead to
nonlinear relationship which can not be properly modeled in linear
models.

I have plotted the residuals against all my fixed factors and I cannot
find any nonlinear relationship. It is possible that I am missing an
important covariate though.

Thanks a lot,
Cherry


On Wed, Oct 7, 2015 at 10:54 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
> My example is not a requirement of a LMM but rather an example of a
> distribution of a variable which can cause troubles with a LMM. Think of an
> area. An area cannot be negative. This can cause artefacts into the
> residuals when you have lots of values near zero. Have a look at this
> example.
>
> n <- 200
> dataset <- data.frame(
>   X = runif(n)
> )
> dataset$eta <- -.1 + 3 * dataset$X
> dataset$Y <- rpois(n, lambda = exp(dataset$eta))
> model <- lm(Y~ X, data = dataset) #wrong analysis for this kind of data,
> here just an illustration of the problem
> plot(fitted(model), resid(model))
>
> But this doesn't seems to be the problem in your case.
>
> I would recommend that you see if there are patterns in the residuals when
> you plot them against the covariates. Maybe you are missing an interaction
> or even an important covariate.
>
> Best regards,
>
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done may be no more than
> asking him to perform a post-mortem examination: he may be able to say what
> the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
>
> 2015-10-07 17:29 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
>>
>> Y is a brain measure that has been standardized. A histogram of Y is here:
>> http://imgur.com/Um8yyuu
>>
>> I am confused about the "Y must be non-negative and the dataset
>> contains observations close to 0" part. Is that the requirements for
>> Y? Is so, then my model could be wrong.
>>
>> On Wed, Oct 7, 2015 at 10:15 AM, Thierry Onkelinx
>> <thierry.onkelinx at inbo.be> wrote:
>> > Can you elaborate on what Y is? Does it has a lower boundary? And if so,
>> > do
>> > you have observations near that boundary? E.g. Y must be non-negative
>> > and
>> > the dataset contains observations close to 0. A densityplot would be
>> > useful.
>> >
>> > ir. Thierry Onkelinx
>> > Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>> > and
>> > Forest
>> > team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
>> > Kliniekstraat 25
>> > 1070 Anderlecht
>> > Belgium
>> >
>> > To call in the statistician after the experiment is done may be no more
>> > than
>> > asking him to perform a post-mortem examination: he may be able to say
>> > what
>> > the experiment died of. ~ Sir Ronald Aylmer Fisher
>> > The plural of anecdote is not data. ~ Roger Brinner
>> > The combination of some data and an aching desire for an answer does not
>> > ensure that a reasonable answer can be extracted from a given body of
>> > data.
>> > ~ John Tukey
>> >
>> > 2015-10-07 17:09 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
>> >>
>> >> Hi Thierry,
>> >>
>> >> Thank you for your reply and sorry for the HTML thing. Below is my
>> >> summary(model) output.
>> >>
>> >> Y, Drink, and Age are continuous variables
>> >> Gender is F & M.
>> >> Family_ID is a factor.
>> >>
>> >> Linear mixed model fit by maximum likelihood  ['lmerMod']
>> >> Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
>> >>    Data: data
>> >>
>> >>      AIC      BIC   logLik deviance df.resid
>> >>   1046.4   1074.0   -516.2   1032.4      372
>> >>
>> >> Scaled residuals:
>> >>      Min       1Q   Median       3Q      Max
>> >> -2.67228 -0.56085 -0.02968  0.66166  2.91452
>> >>
>> >> Random effects:
>> >>  Groups    Name        Variance Std.Dev.
>> >>  Family_ID (Intercept) 0.3550   0.5958
>> >>  Residual                    0.6162   0.7850
>> >> Number of obs: 379, groups:  Family_ID, 189
>> >>
>> >> Fixed effects:
>> >>                           Estimate Std. Error t value
>> >> (Intercept)          1.10309    0.43921   2.511
>> >> Drink                  0.16425    0.08031   2.045
>> >> Gender.M          -0.19364    0.10874  -1.781
>> >> Age                    -0.03377    0.01489  -2.268
>> >> Drink:Gender.M -0.13647    0.10681  -1.278
>> >>
>> >> Correlation of Fixed Effects:
>> >>                 (Intr)     Drnk   Gndr.M  Age
>> >> Drink        -0.098
>> >> Gender.M -0.040 -0.249
>> >> Age           -0.985  0.158 -0.054
>> >> Drnk:G.M  0.042 -0.737 -0.021 -0.085
>> >>
>> >> Thank you very much,
>> >> Cherry
>> >>
>> >> On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
>> >> <thierry.onkelinx at inbo.be> wrote:
>> >> > Dear Cherry,
>> >> >
>> >> > Please don't post in HTML. Have a look at the posting guide.
>> >> >
>> >> > You'll need to provide more information. What is the class of each
>> >> > variable
>> >> > (continuous, count, presence/absence, factor, ...)? What is the
>> >> > output
>> >> > of
>> >> > summary(model)?
>> >> >
>> >> > Best regards,
>> >> >
>> >> > ir. Thierry Onkelinx
>> >> > Instituut voor natuur- en bosonderzoek / Research Institute for
>> >> > Nature
>> >> > and
>> >> > Forest
>> >> > team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
>> >> > Kliniekstraat 25
>> >> > 1070 Anderlecht
>> >> > Belgium
>> >> >
>> >> > To call in the statistician after the experiment is done may be no
>> >> > more
>> >> > than
>> >> > asking him to perform a post-mortem examination: he may be able to
>> >> > say
>> >> > what
>> >> > the experiment died of. ~ Sir Ronald Aylmer Fisher
>> >> > The plural of anecdote is not data. ~ Roger Brinner
>> >> > The combination of some data and an aching desire for an answer does
>> >> > not
>> >> > ensure that a reasonable answer can be extracted from a given body of
>> >> > data.
>> >> > ~ John Tukey
>> >> >
>> >> > 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
>> >> >>
>> >> >> Dear LMM experts:
>> >> >>
>> >> >> I am pretty new to using LMM and I have found the following
>> >> >> situation
>> >> >> bewildering as I was trying to do diagnostics with my fitted model:
>> >> >> my
>> >> >> conditional residuals correlated highly with the fitted values.
>> >> >>
>> >> >> I have a dataset with multiple families, each has 1-4 siblings. I am
>> >> >> trying
>> >> >> to regress Y onto EVs include Drink, Gender, & Age, while using
>> >> >> random
>> >> >> intercept for family. This is the model I used:
>> >> >> model<-lmer(Y~Drink*Gender+Age
>> >> >>                       +(1|Family_ID),data,REML=FALSE)
>> >> >>
>> >> >> After fitting the model, I used
>> >> >> plot(model)
>> >> >> to see the relationship between conditional residuals and fitted
>> >> >> values. I
>> >> >> expect them to be uncorrelated and I expect to see homoscedasticity.
>> >> >>
>> >> >> Yet to my surprise there is a high correlation (~0.5) between the
>> >> >> residuals
>> >> >> and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know
>> >> >> from
>> >> >> GLM that this usually suggest nonlinear relationships between the
>> >> >> EVs
>> >> >> and
>> >> >> the DV.
>> >> >>
>> >> >> I read some online posts (post1
>> >> >>
>> >> >>
>> >> >>
>> >> >> <http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model>
>> >> >> post2
>> >> >>
>> >> >>
>> >> >>
>> >> >> <http://stats.stackexchange.com/questions/168179/correlation-between-standardized-residuals-and-fitted-values-in-a-linear-mixed-e/168210#168210>)
>> >> >> that suggest this can result from a poor model fit. So I tried a few
>> >> >> different models, including: 1) log transform Drink, which is
>> >> >> originally
>> >> >> positively skewed; 2) add random slopes for Drink, Age, etc. None of
>> >> >> these
>> >> >> changes have led to a substantial difference for the residual &
>> >> >> fitted
>> >> >> value correlation.
>> >> >>
>> >> >> Some other info:
>> >> >> 1) my overall model fit is not poor as indicated by the correlation
>> >> >> between
>> >> >> fitted values & Y. It is around 0.8;
>> >> >> 2) most variables in my model has a normal, or at least symmetrical,
>> >> >> distribution.
>> >> >> 3) conditional residuals are normally distributed as shown in
>> >> >> qqplots.
>> >> >> 4) conditional residuals are not correlated with any fixed effects,
>> >> >> such
>> >> >> as
>> >> >> Drink or Age.
>> >> >>
>> >> >> I have two guesses as to what is going on:
>> >> >> 1) maybe the fact that each family is a different size actually
>> >> >> violates
>> >> >> assumptions of the model?
>> >> >> 2) or maybe there is something wrong with estimation of the random
>> >> >> effect
>> >> >> (family intercept)?
>> >> >>
>> >> >> I'd really appreciate your insights as to what is going on here and
>> >> >> if
>> >> >> there is any problems with my model.
>> >> >>
>> >> >> Thank you very much,
>> >> >> Cherry
>> >> >>
>> >> >>         [[alternative HTML version deleted]]
>> >> >>
>> >> >> _______________________________________________
>> >> >> R-sig-mixed-models at r-project.org mailing list
>> >> >> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> >> >
>> >> >
>> >
>> >
>
>



More information about the R-sig-mixed-models mailing list