[R-sig-ME] What to do with zero inflated, negative skewed, negative data: a question about GLMMs

Mon Nov 30 03:16:29 CET 2020

Hi Gabriella,

I'm not sure you really have zero inflation here. 0 inflation usually occurs when you have counts or a yes/no response. The 0 means a 'lack of a response'. One way to think of this is that the 0's represent when the 'thing' you are measuring didn’t occur, and the count is when it did. For example if you had random samples from all over the planet, you wouldn’t find fish in the desert, but you do find them in water. And when you do find them there are lots of different variables that might affect how many you see. So you would fit 2 models:
1) model 1: are fish present i.e. is the sample a water or land sample?
2) model 2: how many fish did you find, assuming they were there.  

So in yr example your 0's don’t really mean an absence of data, they mean a neutral score. 

1 way to analyse this is to use at least 2 logistic regressions. 1 that explains the difference between negative vs neutral, and then another neutral vs positive. You might also want to have a 3rd model that shows negative vs positive. 

Chris Howden B.Sc. (Hons)
Founding Partner
Data Analysis, Modelling and Training
Evidence Based Strategy/Policy Development, IP Commercialisation and Innovation
(mobile) +61 (0) 410 689 945 | (skype) chris using trickysolutions.com.au

-----Original Message-----
From: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> On Behalf Of Gabriella Kountourides
Sent: Monday, 30 November 2020 11:24 AM
To: r-sig-mixed-models using r-project.org
Subject: [R-sig-ME] What to do with zero inflated, negative skewed, negative data: a question about GLMMs

Hello everyone,

This is my first question to this list :) I  hope this email finds you all well.

  I have been struggling for the past few weeks to set an appropriate model for my data. I have read Prof Bolker's practical guide for ecology and evolution paper, as well as the GLMM FAQs which have been immensely helpful. I am only just beginning my stats journey (and R!) and although I am really enjoying it, I have found myself completely stumped with my dataset. I will describe the data set below, and below that the various attempts I have made to analyse it. I would be incredibly grateful to hear your thoughts.

All the very best

Data:

I want to look whether there is a relationship between the phrasing used when a question is asked (positive, negative, neutral wording) and the polarity of the response from the individual.

2638 people were asked a question about medical symptoms.

1/3 of the people were asked it with a negative wording, 1/3 with a neutral one, 1/3 with a positive one.

The big question is: does the way the question is asked  affect the polarity of the response

From this, I did sentiment analysis (using trincker's<https://github.com/trinker/sentimentr> package), which provides a polarity score (this can be negative, neutral or positive) to see whether their responses were more positive or negative, depending on the wording of the question.

Sentiment analysis breaks down responses into sentences, so I have 2638 people, but 7924 sentences, so I would assume to fit ID as a random effect.

Range: -4.0376 to + 0.7915.
Median :-0.1830
Mean   :-0.2149

Mode: 0
skew: -1.7

There are many 0s in my model, these are true 0s, they represent a 'neutral' response, which is important. My data is negatively skewed, so more people answer in a negative way. But I still want to know, whether the phrasings affect the skew/is one phrasing leading to 'less negative' responses?

What I've tried:
Initially, I tried to do a glm with the raw data, but I can't use poisson as it is negative, it is skewed so its not gaussian, and its not binomial.

So next I made 3 new variables, which were counts. For example 'PosCount' scored 1 for each row with a +polarity score, and a 0 if not.  Idem for neutral (sentiment=0) and positive (sentiment>0). Decided to run Zero Inflated Poisson

I ran a glmm for each count variable-example for the positive one:
pos <-glmmTMB(PosCount~ wordingQ + (1|id) + age, data=allprimesent, ziformula=~1, family=poisson)

and then the 'overdisp_fun' function which gave
> overdisp_fun(posmodel)
 chisq                  ratio                          rdf            p
6268.8427185    0.8295412.   7557.0000000    1.0000000

So I suppose my questions are: do you think this is the best thing to do with my data? Do you know of any better thing I can do with the raw data, I'd rather not lose the information about the strength of the sentiment, but if I keep it, I need a model that can deal with 0 inflation, negative skew, and negative numbers.

Many thanks if you've read this! I look forward to hearing from you!
All the best

p.s. I am relatively new to stats and R, please bare that in mind with your terminology if you are kind enough to answer

Gabriella Kountourides

DPhil Student | Department of Anthropology

Evolutionary Medicine and Public Health Group

St. John�s College, University of Oxford

gabriella.kountourides using sjc.ox.ac.uk

Tweet me: https://twitter.com/GKountourides

________________________________

	[[alternative HTML version deleted]]