[R-sig-ME] What to do with zero inflated, negative skewed, negative data: a question about GLMMs

Mon Nov 30 18:09:29 CET 2020

   I think Gabriella may have abandoned the linear mixed model (i.e. 
Gaussian distribution) because of a skewed distribution of responses.  A 
couple of things to keep in mind about this:

     - you don't need to worry about the *marginal* distribution of the 
data (i.e., what you get if you plot the histogram or density of your 
response variable). The assumptions in LMMs (like most models) are about 
the *conditional* distribution, i.e. the distribution of the residuals 
(e.g., fit your model first, then examine lattice::qqmath(fitted_model) 
or hist(residuals(fitted_model))

     - non-normality (including skewness) even in the conditional model 
is much less important to the validity (accuracy of the parameter 
estimates, confidence intervals, etc.) than many people think

    - in principle you could transform the response variable to deal 
with this, although admittedly the choice of transformations is much 
more limited for non-positive data (e.g. Yeo-Johnson transformations, 
see `?car::yjPower`, although there are some issues here about whether 
you're transforming the marginal or the conditional distribution ...

   cheers

     Ben Bolker

On 11/30/20 2:50 AM, Thierry Onkelinx via R-sig-mixed-models wrote:
> Dear Gabriella,
>
> I'd try to fit a single model to the data.The response seems continuous to
> me. So I'd try a Gaussian distribution. You might need to fit a different
> variance for each of the questions.
>
> library(nlme)
> lme(sentiment ~ question + age + (1|patient))
> lme(sentiment ~ question + age + (1|patient), weight = VarIdent(form = ~
> 1|question))
>
> Best regards,
>
> ir. Thierry Onkelinx
> Statisticus / Statistician
>
> Vlaamse Overheid / Government of Flanders
> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
> FOREST
> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
> thierry.onkelinx using inbo.be
> Havenlaan 88 bus 73, 1000 Brussel
> www.inbo.be
>
> ///////////////////////////////////////////////////////////////////////////////////////////
> To call in the statistician after the experiment is done may be no more
> than asking him to perform a post-mortem examination: he may be able to say
> what the experiment died of. ~ Sir Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination of some data and an aching desire for an answer does not
> ensure that a reasonable answer can be extracted from a given body of data.
> ~ John Tukey
> ///////////////////////////////////////////////////////////////////////////////////////////
>
> <https://www.inbo.be>
>
>
> Op ma 30 nov. 2020 om 01:24 schreef Gabriella Kountourides <
> gabriella.kountourides using sjc.ox.ac.uk>:
>
>> Hello everyone,
>>
>> This is my first question to this list :) I  hope this email finds you all
>> well.
>>
>>
>>    I have been struggling for the past few weeks to set an appropriate
>> model for my data. I have read Prof Bolker's practical guide for ecology
>> and evolution paper, as well as the GLMM FAQs which have been immensely
>> helpful. I am only just beginning my stats journey (and R!) and although I
>> am really enjoying it, I have found myself completely stumped with my
>> dataset. I will describe the data set below, and below that the various
>> attempts I have made to analyse it. I would be incredibly grateful to hear
>> your thoughts.
>>
>> All the very best
>>
>> Data:
>>
>>
>> I want to look whether there is a relationship between the phrasing used
>> when a question is asked (positive, negative, neutral wording) and the
>> polarity of the response from the individual.
>>
>>
>> 2638 people were asked a question about medical symptoms.
>>
>> 1/3 of the people were asked it with a negative wording, 1/3 with a
>> neutral one, 1/3 with a positive one.
>>
>> The big question is: does the way the question is asked  affect the
>> polarity of the response
>>
>>
>>  From this, I did sentiment analysis (using trincker's<
>> https://github.com/trinker/sentimentr> package), which provides a
>> polarity score (this can be negative, neutral or positive) to see whether
>> their responses were more positive or negative, depending on the wording of
>> the question.
>>
>>
>> Sentiment analysis breaks down responses into sentences, so I have 2638
>> people, but 7924 sentences, so I would assume to fit ID as a random effect.
>>
>>
>> Range: -4.0376 to + 0.7915.
>> Median :-0.1830
>> Mean   :-0.2149
>>
>> Mode: 0
>> skew: -1.7
>>
>> There are many 0s in my model, these are true 0s, they represent a
>> 'neutral' response, which is important. My data is negatively skewed, so
>> more people answer in a negative way. But I still want to know, whether the
>> phrasings affect the skew/is one phrasing leading to 'less negative'
>> responses?
>>
>> What I've tried:
>> Initially, I tried to do a glm with the raw data, but I can't use poisson
>> as it is negative, it is skewed so its not gaussian, and its not binomial.
>>
>> So next I made 3 new variables, which were counts. For example 'PosCount'
>> scored 1 for each row with a +polarity score, and a 0 if not.  Idem for
>> neutral (sentiment=0) and positive (sentiment>0). Decided to run Zero
>> Inflated Poisson
>>
>> I ran a glmm for each count variable-example for the positive one:
>> pos <-glmmTMB(PosCount~ wordingQ + (1|id) + age, data=allprimesent,
>> ziformula=~1, family=poisson)
>>
>> and then the 'overdisp_fun' function which gave
>>> overdisp_fun(posmodel)
>>   chisq                  ratio                          rdf            p
>> 6268.8427185    0.8295412.   7557.0000000    1.0000000
>>
>> So I suppose my questions are: do you think this is the best thing to do
>> with my data? Do you know of any better thing I can do with the raw data,
>> I'd rather not lose the information about the strength of the sentiment,
>> but if I keep it, I need a model that can deal with 0 inflation, negative
>> skew, and negative numbers.
>>
>> Many thanks if you've read this! I look forward to hearing from you!
>> All the best
>>
>> p.s. I am relatively new to stats and R, please bare that in mind with
>> your terminology if you are kind enough to answer
>>
>>
>> Gabriella Kountourides
>>
>> DPhil Student | Department of Anthropology
>>
>> Evolutionary Medicine and Public Health Group
>>
>> St. John’s College, University of Oxford
>>
>> gabriella.kountourides using sjc.ox.ac.uk
>>
>> Tweet me: https://twitter.com/GKountourides
>>
>> ________________________________
>>
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-mixed-models using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models