[R-sig-ME] What to do with zero inflated, negative skewed, negative data: a question about GLMMs

Mon Nov 30 01:23:48 CET 2020

Hello everyone,

This is my first question to this list :) I  hope this email finds you all well.

  I have been struggling for the past few weeks to set an appropriate model for my data. I have read Prof Bolker's practical guide for ecology and evolution paper, as well as the GLMM FAQs which have been immensely helpful. I am only just beginning my stats journey (and R!) and although I am really enjoying it, I have found myself completely stumped with my dataset. I will describe the data set below, and below that the various attempts I have made to analyse it. I would be incredibly grateful to hear your thoughts.

All the very best

Data:

I want to look whether there is a relationship between the phrasing used when a question is asked (positive, negative, neutral wording) and the polarity of the response from the individual.

2638 people were asked a question about medical symptoms.

1/3 of the people were asked it with a negative wording, 1/3 with a neutral one, 1/3 with a positive one.

The big question is: does the way the question is asked  affect the polarity of the response

>From this, I did sentiment analysis (using trincker's<https://github.com/trinker/sentimentr> package), which provides a polarity score (this can be negative, neutral or positive) to see whether their responses were more positive or negative, depending on the wording of the question.

Sentiment analysis breaks down responses into sentences, so I have 2638 people, but 7924 sentences, so I would assume to fit ID as a random effect.

Range: -4.0376 to + 0.7915.
Median :-0.1830
Mean   :-0.2149

Mode: 0
skew: -1.7

There are many 0s in my model, these are true 0s, they represent a 'neutral' response, which is important. My data is negatively skewed, so more people answer in a negative way. But I still want to know, whether the phrasings affect the skew/is one phrasing leading to 'less negative' responses?

What I've tried:
Initially, I tried to do a glm with the raw data, but I can't use poisson as it is negative, it is skewed so its not gaussian, and its not binomial.

So next I made 3 new variables, which were counts. For example 'PosCount' scored 1 for each row with a +polarity score, and a 0 if not.  Idem for neutral (sentiment=0) and positive (sentiment>0). Decided to run Zero Inflated Poisson

I ran a glmm for each count variable-example for the positive one:
pos <-glmmTMB(PosCount~ wordingQ + (1|id) + age, data=allprimesent, ziformula=~1, family=poisson)

and then the 'overdisp_fun' function which gave
> overdisp_fun(posmodel)
 chisq                  ratio                          rdf            p
6268.8427185    0.8295412.   7557.0000000    1.0000000

So I suppose my questions are: do you think this is the best thing to do with my data? Do you know of any better thing I can do with the raw data, I'd rather not lose the information about the strength of the sentiment, but if I keep it, I need a model that can deal with 0 inflation, negative skew, and negative numbers.

Many thanks if you've read this! I look forward to hearing from you!
All the best

p.s. I am relatively new to stats and R, please bare that in mind with your terminology if you are kind enough to answer

Gabriella Kountourides

DPhil Student | Department of Anthropology

Evolutionary Medicine and Public Health Group

St. John�s College, University of Oxford

gabriella.kountourides using sjc.ox.ac.uk

Tweet me: https://twitter.com/GKountourides

________________________________

	[[alternative HTML version deleted]]