[R-sig-ME] Problems with model (assumptions)

Mon Nov 23 16:21:18 CET 2015

Thanks for both of your answers.

I will proceed with the log transformed version then as this definitly 
provides the best fit.

One more question though regarding the point made by Thierry regarding 
the residuals:
I thought that when looking at scaled residuals (pearson or deviance) 
they should be distributed normally (at least asymptotically). Is this 
wrong?

Thanks a lot guys!
Philipp

On 11/23/2015 02:59 PM, Ben Bolker wrote:
> I *can* see the plots; it looks to me like Philipp's
> log-transformation is almost perfect (the Q-Q plot shows a tiny bit of
> a skew, but I wouldn't worry about it).  The model gives you a
> convergence warning with magnitude 0.002, but we know from experience
> (see ?convergence) that especially for 10^6 observations this is
> probably a false positive.  I would definitely recommend proceeding
> with
>
> m_lmer_log = lmer(log(body_length)~1+index+(1|author)+(1|length),
> data=data, REML=FALSE)
>
> (i.e. line 84).
>
>
> For count data with a large mean (your intercept is approx. 230),
> log-transforming seems perfectly reasonable.
>
> On Mon, Nov 23, 2015 at 3:17 AM, Thierry Onkelinx
> <thierry.onkelinx at inbo.be> wrote:
>> Dear Philipp,
>>
>> I'm missing the graphs for the data exploration step in the notebook. So
>> you can get an idea if the relations of with the explanatory variables are
>> (log)linear.
>>
>> The residual plot from the Gaussian model are typical when modelling count
>> data. So you need a Poisson or negative binomial distribution.
>>
>> normal qqplots for glm models are irrelevant. residuals versus fit are
>> difficult to interpret. You should focus on residuals versus explanatory
>> variables (fixed and random).
>>
>> You could consider using length as an offset factor. That seems to make
>> more sense than as a random effect. Since length is the maximum body length
>> per author, you would model the relative body length per author.
>>
>> There are other R packages that can fit glmm. glmmADMB, INLA, ... You can
>> try them and see what happens.
>>
>> Best regards,
>>
>> ir. Thierry Onkelinx
>> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
>> Forest
>> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
>> Kliniekstraat 25
>> 1070 Anderlecht
>> Belgium
>>
>> To call in the statistician after the experiment is done may be no more
>> than asking him to perform a post-mortem examination: he may be able to say
>> what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> The plural of anecdote is not data. ~ Roger Brinner
>> The combination of some data and an aching desire for an answer does not
>> ensure that a reasonable answer can be extracted from a given body of data.
>> ~ John Tukey
>>
>> 2015-11-20 14:17 GMT+01:00 Philipp Singer <killver at gmail.com>:
>>
>>> Dear all,
>>>
>>> I am currently trying to investigate the effect of time (in the sense of
>>> an index) on the length of a text that people write (body_length). So,
>>> e.g., my hypothesis is that the later someone writes a text, the shorter it
>>> is. All authors do not write the same amount of individual texts, thus I
>>> have an additional variable that captures the maximum index (length). One
>>> further thing to note is that authors can have several "sessions" on
>>> different days.
>>>
>>> I have started to use a linear mixed-effects model. However, the basic
>>> assumptions of linear regression do not seem to hold (e.g., normality of
>>> residuals) which is to be expected for count data (text length).
>>>
>>> Thus, I have tried several other GLMs and adaptions. However, for most of
>>> them, the assumptions do not hold as well. Also, I receive several odd
>>> errors for some models.
>>>
>>> The best results can be achieved when I just log transform the outcome and
>>> use linear regression. However, as suggested in literature, this is not the
>>> proper way of treating count data.
>>>
>>> One thing to note is, that my data is enormeous (50mio. data points). I
>>> have worked with a sample of 1mio datapoints here, results for the whole
>>> data are similar though.
>>>
>>> Instead of now individually highlighting all the results in this mail, I
>>> have decided to prepare an iPython notebook (using R and lme4) that should
>>> convey my main procedure that I have conducted until now.
>>>
>>> It can be found here:
>>> https://nbviewer.jupyter.org/gist/anonymous/2897dd277a35a0df52ea
>>>
>>> I am hoping for some advice on how to proceed.
>>>
>>> Thanks in advance!
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models