[R-sig-ME] Problems with model (assumptions)

Tue Nov 24 19:42:39 CET 2015

On Mon, Nov 23, 2015 at 10:21 AM, Philipp Singer <killver at gmail.com> wrote:
> Thanks for both of your answers.
>
> I will proceed with the log transformed version then as this definitly
> provides the best fit.
>
> One more question though regarding the point made by Thierry regarding the
> residuals:
> I thought that when looking at scaled residuals (pearson or deviance) they
> should be distributed normally (at least asymptotically). Is this wrong?

   Yes, asymptotically, but the relevant limit is "large samples per
point" rather than "large numbers of points" -- that is, Poisson or
negative binomial or binomial samples must have a large mean count [or
intermediate in the case of binomial, i.e. not to close to 100%], in
which case the conditional distribution converges to normality, but
with different variances depending on the mean, which gets corrected
by Pearson/deviance residual computation.    In the low-count case
(which is where most GLModeling gets done) the assumption may not
hold.  It's particularly terrible for Bernoulli responses.

   In looking at your model again, I wonder: why are you treating
length as a random effect?  I would think that lengths would be
generally better modeled as continuous covariates, e.g. via an
additive model/spline term in the linear model.

>
> Thanks a lot guys!
> Philipp
>
>
> On 11/23/2015 02:59 PM, Ben Bolker wrote:
>>
>> I *can* see the plots; it looks to me like Philipp's
>> log-transformation is almost perfect (the Q-Q plot shows a tiny bit of
>> a skew, but I wouldn't worry about it).  The model gives you a
>> convergence warning with magnitude 0.002, but we know from experience
>> (see ?convergence) that especially for 10^6 observations this is
>> probably a false positive.  I would definitely recommend proceeding
>> with
>>
>> m_lmer_log = lmer(log(body_length)~1+index+(1|author)+(1|length),
>> data=data, REML=FALSE)
>>
>> (i.e. line 84).
>>
>>
>> For count data with a large mean (your intercept is approx. 230),
>> log-transforming seems perfectly reasonable.
>>
>> On Mon, Nov 23, 2015 at 3:17 AM, Thierry Onkelinx
>> <thierry.onkelinx at inbo.be> wrote:
>>>
>>> Dear Philipp,
>>>
>>> I'm missing the graphs for the data exploration step in the notebook. So
>>> you can get an idea if the relations of with the explanatory variables
>>> are
>>> (log)linear.
>>>
>>> The residual plot from the Gaussian model are typical when modelling
>>> count
>>> data. So you need a Poisson or negative binomial distribution.
>>>
>>> normal qqplots for glm models are irrelevant. residuals versus fit are
>>> difficult to interpret. You should focus on residuals versus explanatory
>>> variables (fixed and random).
>>>
>>> You could consider using length as an offset factor. That seems to make
>>> more sense than as a random effect. Since length is the maximum body
>>> length
>>> per author, you would model the relative body length per author.
>>>
>>> There are other R packages that can fit glmm. glmmADMB, INLA, ... You can
>>> try them and see what happens.
>>>
>>> Best regards,
>>>
>>> ir. Thierry Onkelinx
>>> Instituut voor natuur- en bosonderzoek / Research Institute for Nature
>>> and
>>> Forest
>>> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
>>> Kliniekstraat 25
>>> 1070 Anderlecht
>>> Belgium
>>>
>>> To call in the statistician after the experiment is done may be no more
>>> than asking him to perform a post-mortem examination: he may be able to
>>> say
>>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>>> The plural of anecdote is not data. ~ Roger Brinner
>>> The combination of some data and an aching desire for an answer does not
>>> ensure that a reasonable answer can be extracted from a given body of
>>> data.
>>> ~ John Tukey
>>>
>>> 2015-11-20 14:17 GMT+01:00 Philipp Singer <killver at gmail.com>:
>>>
>>>> Dear all,
>>>>
>>>> I am currently trying to investigate the effect of time (in the sense of
>>>> an index) on the length of a text that people write (body_length). So,
>>>> e.g., my hypothesis is that the later someone writes a text, the shorter
>>>> it
>>>> is. All authors do not write the same amount of individual texts, thus I
>>>> have an additional variable that captures the maximum index (length).
>>>> One
>>>> further thing to note is that authors can have several "sessions" on
>>>> different days.
>>>>
>>>> I have started to use a linear mixed-effects model. However, the basic
>>>> assumptions of linear regression do not seem to hold (e.g., normality of
>>>> residuals) which is to be expected for count data (text length).
>>>>
>>>> Thus, I have tried several other GLMs and adaptions. However, for most
>>>> of
>>>> them, the assumptions do not hold as well. Also, I receive several odd
>>>> errors for some models.
>>>>
>>>> The best results can be achieved when I just log transform the outcome
>>>> and
>>>> use linear regression. However, as suggested in literature, this is not
>>>> the
>>>> proper way of treating count data.
>>>>
>>>> One thing to note is, that my data is enormeous (50mio. data points). I
>>>> have worked with a sample of 1mio datapoints here, results for the whole
>>>> data are similar though.
>>>>
>>>> Instead of now individually highlighting all the results in this mail, I
>>>> have decided to prepare an iPython notebook (using R and lme4) that
>>>> should
>>>> convey my main procedure that I have conducted until now.
>>>>
>>>> It can be found here:
>>>> https://nbviewer.jupyter.org/gist/anonymous/2897dd277a35a0df52ea
>>>>
>>>> I am hoping for some advice on how to proceed.
>>>>
>>>> Thanks in advance!
>>>>
>>>> _______________________________________________
>>>> R-sig-mixed-models at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models