[R-sig-ME] Problems with model (assumptions)

Ben Bolker bbolker at gmail.com
Mon Nov 23 14:59:45 CET 2015


I *can* see the plots; it looks to me like Philipp's
log-transformation is almost perfect (the Q-Q plot shows a tiny bit of
a skew, but I wouldn't worry about it).  The model gives you a
convergence warning with magnitude 0.002, but we know from experience
(see ?convergence) that especially for 10^6 observations this is
probably a false positive.  I would definitely recommend proceeding
with

m_lmer_log = lmer(log(body_length)~1+index+(1|author)+(1|length),
data=data, REML=FALSE)

(i.e. line 84).


For count data with a large mean (your intercept is approx. 230),
log-transforming seems perfectly reasonable.

On Mon, Nov 23, 2015 at 3:17 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
> Dear Philipp,
>
> I'm missing the graphs for the data exploration step in the notebook. So
> you can get an idea if the relations of with the explanatory variables are
> (log)linear.
>
> The residual plot from the Gaussian model are typical when modelling count
> data. So you need a Poisson or negative binomial distribution.
>
> normal qqplots for glm models are irrelevant. residuals versus fit are
> difficult to interpret. You should focus on residuals versus explanatory
> variables (fixed and random).
>
> You could consider using length as an offset factor. That seems to make
> more sense than as a random effect. Since length is the maximum body length
> per author, you would model the relative body length per author.
>
> There are other R packages that can fit glmm. glmmADMB, INLA, ... You can
> try them and see what happens.
>
> Best regards,
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
> Forest
> team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
> Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
>
> 2015-11-20 14:17 GMT+01:00 Philipp Singer <killver at gmail.com>:
>
>> Dear all,
>>
>> I am currently trying to investigate the effect of time (in the sense of
>> an index) on the length of a text that people write (body_length). So,
>> e.g., my hypothesis is that the later someone writes a text, the shorter it
>> is. All authors do not write the same amount of individual texts, thus I
>> have an additional variable that captures the maximum index (length). One
>> further thing to note is that authors can have several "sessions" on
>> different days.
>>
>> I have started to use a linear mixed-effects model. However, the basic
>> assumptions of linear regression do not seem to hold (e.g., normality of
>> residuals) which is to be expected for count data (text length).
>>
>> Thus, I have tried several other GLMs and adaptions. However, for most of
>> them, the assumptions do not hold as well. Also, I receive several odd
>> errors for some models.
>>
>> The best results can be achieved when I just log transform the outcome and
>> use linear regression. However, as suggested in literature, this is not the
>> proper way of treating count data.
>>
>> One thing to note is, that my data is enormeous (50mio. data points). I
>> have worked with a sample of 1mio datapoints here, results for the whole
>> data are similar though.
>>
>> Instead of now individually highlighting all the results in this mail, I
>> have decided to prepare an iPython notebook (using R and lme4) that should
>> convey my main procedure that I have conducted until now.
>>
>> It can be found here:
>> https://nbviewer.jupyter.org/gist/anonymous/2897dd277a35a0df52ea
>>
>> I am hoping for some advice on how to proceed.
>>
>> Thanks in advance!
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models



More information about the R-sig-mixed-models mailing list