[R-sig-ME] Problems with model (assumptions)
Philipp Singer
killver at gmail.com
Fri Nov 20 14:17:20 CET 2015
Dear all,
I am currently trying to investigate the effect of time (in the sense of
an index) on the length of a text that people write (body_length). So,
e.g., my hypothesis is that the later someone writes a text, the shorter
it is. All authors do not write the same amount of individual texts,
thus I have an additional variable that captures the maximum index
(length). One further thing to note is that authors can have several
"sessions" on different days.
I have started to use a linear mixed-effects model. However, the basic
assumptions of linear regression do not seem to hold (e.g., normality of
residuals) which is to be expected for count data (text length).
Thus, I have tried several other GLMs and adaptions. However, for most
of them, the assumptions do not hold as well. Also, I receive several
odd errors for some models.
The best results can be achieved when I just log transform the outcome
and use linear regression. However, as suggested in literature, this is
not the proper way of treating count data.
One thing to note is, that my data is enormeous (50mio. data points). I
have worked with a sample of 1mio datapoints here, results for the whole
data are similar though.
Instead of now individually highlighting all the results in this mail, I
have decided to prepare an iPython notebook (using R and lme4) that
should convey my main procedure that I have conducted until now.
It can be found here:
https://nbviewer.jupyter.org/gist/anonymous/2897dd277a35a0df52ea
I am hoping for some advice on how to proceed.
Thanks in advance!
More information about the R-sig-mixed-models
mailing list