[R-sig-ME] Advice regarding model choice

Tue Oct 27 02:52:52 CET 2015

Dear David,

Thanks a lot for your response.

Actually, I have already tried the first solution you suggested by 
adding an observation random effect. Unfortunately, it ends up with an 
error that I have not yet found a solution for:

Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in 
pwrssUpdate

My overall data is also really large, something like 35mio. samples...I 
could work with samples though.

In general though, I am not sure whether the effort is even worth it 
based on the qqplots I have attached to my previous mail. The data does 
not seem to be fit well with either a poisson nor a negative binomial model.

Cheers Philipp

P.S.: David accidentally just replied to my mail which is why I continue 
the discussion here.

On 10/26/2015 04:55 PM, David Jones wrote:
> Dear Philipp - I recently had a similar situation, and here is my 2c.
>
> Regarding model, two common ways to account for model overdispersion 
> are negative binomial and poisson-lognormal models (for an easily 
> implemented poisson-lognormal model lme4, see the code that 
> corresponds to a Ben Bolker manuscript at 
> https://blogs.umass.edu/nrc697sa-finnj/2012/11/08/bolkers-reanalysis-of-owl-data/). 
> They will probably run very slowly with that size of sample (recently 
> when running a dataset of N=600k, it took me about 2 hours for a model 
> even on Amazon EC2 for a negative binomial model; may want to use the 
> verbose statement and system.time() function to get an idea that 
> progress is happening and to have a good idea of how long the model 
> took when it is completed). These models took much longer to run than 
> regular poisson on my machines.
>
> For the convergence issues, I found the following site very helpful: 
> https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html
> In particular, recoding predictors was helpful (while mine were 
> categorical, changing the coding scheme helped some), and also using 
> prior model starting values in a second model run were enough to 
> eliminate the warnings (and at times, using the second model run as 
> starting values in a third model). Admittedly these issues may 
> disappear/change when you change modeling approaches.
>
> On Mon, Oct 26, 2015 at 7:15 PM, Philipp Singer <killver at gmail.com 
> <mailto:killver at gmail.com>> wrote:
>
>     My current data to study looks like the following:
>
>     Suppose that we repeatedly let subjects write a piece of text. We
>     are now mainly interested in whether the consecutive writing has
>     an effect on features of the written text.
>
>     For example, we can hypothesize that the fifth text is shorter
>     than the first.
>
>     To that end, the data looks like the following (based on only the
>     text length feature):
>
>     subject | text_length (characters) | index | total_amount
>
>     I have identified that the total_amount is an important feature to
>     consider as the e.g., text length is different for people writing
>     the text 100 times vs. those writing it only 5 times; we have no
>     balanced setting here.
>
>     Sample data for one subject could look like:
>     subject | text_length | index | total_amount
>     1 | 100 | 1 | 3
>     1 |   78 | 2 | 3
>     1 |   80 | 3 | 3
>
>     A reasonable model my experiments have suggested is the following:
>
>     text_length ~ 1 + index + total_amount + (1|subject)
>
>     Alternatively, it might be also reasonable to add (1|total_amount)
>     instead of incorporating it as a fixed effect.
>
>     In this model, as hypothesized, the index shows a negative
>     coefficient.
>
>     What my main reason for this post now is though, that I am unsure
>     whether I can justify the usage of a linear model here. Actually,
>     the data is not normally distributed and also the residuals are not.
>
>     In the following, I have plotted some qqplots with different fits
>     (based on a large sample).
>
>     http://imgur.com/a/jinav
>
>     Usually, I would proceed with such "count" data by using a poisson
>     glm, however it does not converge. Also, as the plots suggest, a
>     poisson distribution does not seem to be a good fit here.
>     Additionally, the poisson fit indicates strong overdispersion.
>
>     An important thing to note here, is that my real data is very,
>     very large (imagine multiple millions of data points).
>
>     Do you guys have any suggestions on how to proceed?
>
>     Thanks!
>
>     _______________________________________________
>     R-sig-mixed-models at r-project.org
>     <mailto:R-sig-mixed-models at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>

	[[alternative HTML version deleted]]