[R-sig-ME] Advice regarding model choice
Philipp Singer
killver at gmail.com
Tue Oct 27 00:15:33 CET 2015
My current data to study looks like the following:
Suppose that we repeatedly let subjects write a piece of text. We are
now mainly interested in whether the consecutive writing has an effect
on features of the written text.
For example, we can hypothesize that the fifth text is shorter than the
first.
To that end, the data looks like the following (based on only the text
length feature):
subject | text_length (characters) | index | total_amount
I have identified that the total_amount is an important feature to
consider as the e.g., text length is different for people writing the
text 100 times vs. those writing it only 5 times; we have no balanced
setting here.
Sample data for one subject could look like:
subject | text_length | index | total_amount
1 | 100 | 1 | 3
1 | 78 | 2 | 3
1 | 80 | 3 | 3
A reasonable model my experiments have suggested is the following:
text_length ~ 1 + index + total_amount + (1|subject)
Alternatively, it might be also reasonable to add (1|total_amount)
instead of incorporating it as a fixed effect.
In this model, as hypothesized, the index shows a negative coefficient.
What my main reason for this post now is though, that I am unsure
whether I can justify the usage of a linear model here. Actually, the
data is not normally distributed and also the residuals are not.
In the following, I have plotted some qqplots with different fits (based
on a large sample).
http://imgur.com/a/jinav
Usually, I would proceed with such "count" data by using a poisson glm,
however it does not converge. Also, as the plots suggest, a poisson
distribution does not seem to be a good fit here. Additionally, the
poisson fit indicates strong overdispersion.
An important thing to note here, is that my real data is very, very
large (imagine multiple millions of data points).
Do you guys have any suggestions on how to proceed?
Thanks!
More information about the R-sig-mixed-models
mailing list