[R-sig-ME] Problems with model (assumptions)

Fri Nov 20 14:17:20 CET 2015

Dear all,

I am currently trying to investigate the effect of time (in the sense of 
an index) on the length of a text that people write (body_length). So, 
e.g., my hypothesis is that the later someone writes a text, the shorter 
it is. All authors do not write the same amount of individual texts, 
thus I have an additional variable that captures the maximum index 
(length). One further thing to note is that authors can have several 
"sessions" on different days.

I have started to use a linear mixed-effects model. However, the basic 
assumptions of linear regression do not seem to hold (e.g., normality of 
residuals) which is to be expected for count data (text length).

Thus, I have tried several other GLMs and adaptions. However, for most 
of them, the assumptions do not hold as well. Also, I receive several 
odd errors for some models.

The best results can be achieved when I just log transform the outcome 
and use linear regression. However, as suggested in literature, this is 
not the proper way of treating count data.

One thing to note is, that my data is enormeous (50mio. data points). I 
have worked with a sample of 1mio datapoints here, results for the whole 
data are similar though.

Instead of now individually highlighting all the results in this mail, I 
have decided to prepare an iPython notebook (using R and lme4) that 
should convey my main procedure that I have conducted until now.

It can be found here:
https://nbviewer.jupyter.org/gist/anonymous/2897dd277a35a0df52ea

I am hoping for some advice on how to proceed.

Thanks in advance!