[R-sig-ME] Advice regarding model choice

Tue Oct 27 00:15:33 CET 2015

My current data to study looks like the following:

Suppose that we repeatedly let subjects write a piece of text. We are 
now mainly interested in whether the consecutive writing has an effect 
on features of the written text.

For example, we can hypothesize that the fifth text is shorter than the 
first.

To that end, the data looks like the following (based on only the text 
length feature):

subject | text_length (characters) | index | total_amount

I have identified that the total_amount is an important feature to 
consider as the e.g., text length is different for people writing the 
text 100 times vs. those writing it only 5 times; we have no balanced 
setting here.

Sample data for one subject could look like:
subject | text_length | index | total_amount
1 | 100 | 1 | 3
1 |   78 | 2 | 3
1 |   80 | 3 | 3

A reasonable model my experiments have suggested is the following:

text_length ~ 1 + index + total_amount + (1|subject)

Alternatively, it might be also reasonable to add (1|total_amount) 
instead of incorporating it as a fixed effect.

In this model, as hypothesized, the index shows a negative coefficient.

What my main reason for this post now is though, that I am unsure 
whether I can justify the usage of a linear model here. Actually, the 
data is not normally distributed and also the residuals are not.

In the following, I have plotted some qqplots with different fits (based 
on a large sample).

http://imgur.com/a/jinav

Usually, I would proceed with such "count" data by using a poisson glm, 
however it does not converge. Also, as the plots suggest, a poisson 
distribution does not seem to be a good fit here. Additionally, the 
poisson fit indicates strong overdispersion.

An important thing to note here, is that my real data is very, very 
large (imagine multiple millions of data points).

Do you guys have any suggestions on how to proceed?

Thanks!