[R-sig-ME] Advice regarding model choice
Philipp Singer
killver at gmail.com
Tue Oct 27 02:52:52 CET 2015
Dear David,
Thanks a lot for your response.
Actually, I have already tried the first solution you suggested by
adding an observation random effect. Unfortunately, it ends up with an
error that I have not yet found a solution for:
Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in
pwrssUpdate
My overall data is also really large, something like 35mio. samples...I
could work with samples though.
In general though, I am not sure whether the effort is even worth it
based on the qqplots I have attached to my previous mail. The data does
not seem to be fit well with either a poisson nor a negative binomial model.
Cheers Philipp
P.S.: David accidentally just replied to my mail which is why I continue
the discussion here.
On 10/26/2015 04:55 PM, David Jones wrote:
> Dear Philipp - I recently had a similar situation, and here is my 2c.
>
> Regarding model, two common ways to account for model overdispersion
> are negative binomial and poisson-lognormal models (for an easily
> implemented poisson-lognormal model lme4, see the code that
> corresponds to a Ben Bolker manuscript at
> https://blogs.umass.edu/nrc697sa-finnj/2012/11/08/bolkers-reanalysis-of-owl-data/).
> They will probably run very slowly with that size of sample (recently
> when running a dataset of N=600k, it took me about 2 hours for a model
> even on Amazon EC2 for a negative binomial model; may want to use the
> verbose statement and system.time() function to get an idea that
> progress is happening and to have a good idea of how long the model
> took when it is completed). These models took much longer to run than
> regular poisson on my machines.
>
> For the convergence issues, I found the following site very helpful:
> https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html
> In particular, recoding predictors was helpful (while mine were
> categorical, changing the coding scheme helped some), and also using
> prior model starting values in a second model run were enough to
> eliminate the warnings (and at times, using the second model run as
> starting values in a third model). Admittedly these issues may
> disappear/change when you change modeling approaches.
>
> On Mon, Oct 26, 2015 at 7:15 PM, Philipp Singer <killver at gmail.com
> <mailto:killver at gmail.com>> wrote:
>
> My current data to study looks like the following:
>
> Suppose that we repeatedly let subjects write a piece of text. We
> are now mainly interested in whether the consecutive writing has
> an effect on features of the written text.
>
> For example, we can hypothesize that the fifth text is shorter
> than the first.
>
> To that end, the data looks like the following (based on only the
> text length feature):
>
> subject | text_length (characters) | index | total_amount
>
> I have identified that the total_amount is an important feature to
> consider as the e.g., text length is different for people writing
> the text 100 times vs. those writing it only 5 times; we have no
> balanced setting here.
>
> Sample data for one subject could look like:
> subject | text_length | index | total_amount
> 1 | 100 | 1 | 3
> 1 | 78 | 2 | 3
> 1 | 80 | 3 | 3
>
> A reasonable model my experiments have suggested is the following:
>
> text_length ~ 1 + index + total_amount + (1|subject)
>
> Alternatively, it might be also reasonable to add (1|total_amount)
> instead of incorporating it as a fixed effect.
>
> In this model, as hypothesized, the index shows a negative
> coefficient.
>
> What my main reason for this post now is though, that I am unsure
> whether I can justify the usage of a linear model here. Actually,
> the data is not normally distributed and also the residuals are not.
>
> In the following, I have plotted some qqplots with different fits
> (based on a large sample).
>
> http://imgur.com/a/jinav
>
> Usually, I would proceed with such "count" data by using a poisson
> glm, however it does not converge. Also, as the plots suggest, a
> poisson distribution does not seem to be a good fit here.
> Additionally, the poisson fit indicates strong overdispersion.
>
> An important thing to note here, is that my real data is very,
> very large (imagine multiple millions of data points).
>
> Do you guys have any suggestions on how to proceed?
>
> Thanks!
>
> _______________________________________________
> R-sig-mixed-models at r-project.org
> <mailto:R-sig-mixed-models at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
>
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list