[R-sig-ME] Problems with model (assumptions)

Mon Nov 23 09:17:28 CET 2015

Dear Philipp,

I'm missing the graphs for the data exploration step in the notebook. So
you can get an idea if the relations of with the explanatory variables are
(log)linear.

The residual plot from the Gaussian model are typical when modelling count
data. So you need a Poisson or negative binomial distribution.

normal qqplots for glm models are irrelevant. residuals versus fit are
difficult to interpret. You should focus on residuals versus explanatory
variables (fixed and random).

You could consider using length as an offset factor. That seems to make
more sense than as a random effect. Since length is the maximum body length
per author, you would model the relative body length per author.

There are other R packages that can fit glmm. glmmADMB, INLA, ... You can
try them and see what happens.

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-11-20 14:17 GMT+01:00 Philipp Singer <killver op gmail.com>:

> Dear all,
>
> I am currently trying to investigate the effect of time (in the sense of
> an index) on the length of a text that people write (body_length). So,
> e.g., my hypothesis is that the later someone writes a text, the shorter it
> is. All authors do not write the same amount of individual texts, thus I
> have an additional variable that captures the maximum index (length). One
> further thing to note is that authors can have several "sessions" on
> different days.
>
> I have started to use a linear mixed-effects model. However, the basic
> assumptions of linear regression do not seem to hold (e.g., normality of
> residuals) which is to be expected for count data (text length).
>
> Thus, I have tried several other GLMs and adaptions. However, for most of
> them, the assumptions do not hold as well. Also, I receive several odd
> errors for some models.
>
> The best results can be achieved when I just log transform the outcome and
> use linear regression. However, as suggested in literature, this is not the
> proper way of treating count data.
>
> One thing to note is, that my data is enormeous (50mio. data points). I
> have worked with a sample of 1mio datapoints here, results for the whole
> data are similar though.
>
> Instead of now individually highlighting all the results in this mail, I
> have decided to prepare an iPython notebook (using R and lme4) that should
> convey my main procedure that I have conducted until now.
>
> It can be found here:
> https://nbviewer.jupyter.org/gist/anonymous/2897dd277a35a0df52ea
>
> I am hoping for some advice on how to proceed.
>
> Thanks in advance!
>
> _______________________________________________
> R-sig-mixed-models op r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

	[[alternative HTML version deleted]]