[R-sig-ME] Is it ok to use lmer() for an ordered categorical (5 levels) response variable?

Pierce, Steven p|erce@1 @end|ng |rom m@u@edu
Wed Mar 6 17:04:09 CET 2019

@Harold. There is a vast array of different ways that people analyze survey data. Some are far more appropriate and methodologically sound than others. Taking unweighted sums of binary items is indeed a very common practice for producing scale scores, though it is usually done without considering or testing the underlying measurement assumptions that it implies. I didn't endorse that overly simplistic practice but I have seen huge numbers of published papers that use it. More rigorous methods (such as I suggested) are reasonable ways to use observed responses to measure higher-order theoretical constructs while being thoughtful about the assumptions involved. My point was that one should be more thoughtful about the measurement process that yields the scores to be analyzed. A different approach to that step can change whether the score obtained is ordinal/count or legitimately continuous. 


-----Original Message-----
From: Doran, Harold <HDoran using air.org> 
Sent: Wednesday, March 6, 2019 10:12 AM
To: Pierce, Steven <pierces1 using msu.edu>; Nicolas Deguines <nicodeguines using gmail.com>
Cc: r-sig-mixed-models <r-sig-mixed-models using r-project.org>
Subject: RE: [R-sig-ME] Is it ok to use lmer() for an ordered categorical (5 levels) response variable?

@steven, no, this is not at all how survey researchers generate scaled scores. Scaled scores are MLEs of a likelihood function (or sometimes posterior mean/modes) constructed from the observed responses. 

To the OP, using lmer for the ordered response model is not an appropriate modeling strategy at all. Take a look at your predicted values. Are they < 0 or > 5? What does that mean?

-----Original Message-----
From: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> On Behalf Of Pierce, Steven
Sent: Wednesday, March 06, 2019 9:38 AM
To: Nicolas Deguines <nicodeguines using gmail.com>
Cc: r-sig-mixed-models <r-sig-mixed-models using r-project.org>
Subject: Re: [R-sig-ME] Is it ok to use lmer() for an ordered categorical (5 levels) response variable?


So your score is constructed by summing 4 binary variables, each of which represents a feature of a natural environment. This is similar to how social science researchers use survey data with binary questions that get combined into scale scores. Such scale scores are usually intended to measure latent variables. 

By taking a simple sum of the binary items as the scale score, you are making strong assumptions about each item being equally important to measuring the latent variable. That's a testable hypothesis. Consider doing a confirmatory factor analysis using the 4 items as indicators of a latent variable (naturalness) that is assumed to be continuous and normally distributed. The point of this is to get a decent measurement model for the latent variable. 

Allow the factor loadings to vary across items (some may be more important to measuring the latent variable than others); also test for whether there are correlations between the item residuals that are not explained by being indicators of the latent variable. If you can achieve adequate CFA model fit and composite reliability for the latent variable (i.e., naturalness), then you have the ability to save out factor score estimates of the latent variable that should be continuous, normally distribute values. 

Then you can either take those factor score estimates and plug them into your mixed model, or just expand from a confirmatory factor analysis model into a full structural equation model that includes all the elements of the desired mixed model, with the latent "naturalness" variable as the outcome. The latter approach is more theoretically desirable because it reduces bias due to measurement error. 

Steven J. Pierce, Ph.D.
Acting Director; Associate Director
Center for Statistical Training & Consulting (CSTAT) Michigan State University
E-mail: pierces1 using msu

-----Original Message-----
From: Nicolas Deguines <nicodeguines using gmail.com>
Sent: Wednesday, March 6, 2019 9:01 AM
To: r-sig-mixed-models using r-project.org
Subject: Re: [R-sig-ME] Is it ok to use lmer() for an ordered categorical (5 levels) response variable?

Hello Phillip and all,

Thanks a lot Phillip for your very interesting and useful answer, and for the paper from Liddell & Kruschke. It helps a lot.

About trying other link and threshold functions in clmm: no huge difference in my case unfortunately. I tried different combinations of each.
'equidistant' did do better, but the improvement was far from enough.

I computed density plots for my response variable as observed and as predicted from my lmer() model (similar to what Liddell and Kruschke do in Figure 6): the linear mixed-model does pretty well in fitting the data.
=> so I'd be enclined to trust the results from my lmer models in the present case (but Liddell and Kruschke did show very clear cases when a linear model fit very poorly the ordinal data).

Meanwhile, I thought of another alternative for analyzing this response variable and I would be curious to read what people may think about it.
Before presenting that alternative, I need to say more about that 5-levels response variable.
It is a score built by Muratet and Fontaine (2015)* to assess the naturalness of a given private backyard (it is shown to be correlated with higher abundance of butterflies).
In the backyard: fallow area, nettles (*Urtica dioica*), ivy (*Hedera helix*), and brambles (*Rubus spp.*) are each scored one if present, and the naturalness index was computed as the sum of these scores.
=> it results in a 5-levels ordinal variable because it can go from 0 to 4, and each increase in 1 means a backyard with more features of 'naturalness'.
I wonder thus if this could be modelled using a glmer() with family = binomial and feeding to the model two columns: cbind(sum of 1's, sum of
0's) (see R documentation for family{stats}, in the Details: "*As a two-column integer matrix: the first column gives the number of successes and the second the number of failures.*") I will try and see how the model fit the data. But I would be interested in getting a theoretical opinion.

I hope this can help others too

Best regards,
Nicolas Deguines


Postdoctoral Research Associate
Laboratoire Ecologie, Systématique et Evolution Université Paris Sud, Orsay, France
Website: https://urldefense.proofpoint.com/v2/url?u=http-3A__nicolasdeguines.weebly.com_&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=umRU32TJsniNhkwfbYgkiGyv_zmKT1LJU9qtyEGdVXg&e=

On Tue, 5 Mar 2019 at 13:04, Phillip Alday <phillip.alday using mpi.nl> wrote:

> Hi Nicolas,
> How much you can get away bending the assumptions depends in some ways 
> on how well the resulting model fits your data. If the resulting model 
> is a poor fit, then it's not a great model for performing inference. 
> The other problem with bending assumptions is that a lot of 'error 
> statistics' (standard errors, t-values, and basically anything related 
> to significance testings) aren't guaranteed to do what they are 
> supposed to do. (In your case, the good behavior of your residuals 
> suggests that this won't be a huge problem, but there are no 
> promises.)
> You can get around this a bit by doing things like cross-validation or 
> other inferential steps based on how well the model generalizes to / 
> predicts new data instead of significance testing of coefficients or 
> linear hypotheses.
> John Kruschke has written about this issue at some length and seems 
> convinced that it's (almost) always a bad idea to bend the 
> metric/continuous assumption when dealing with ordinal data:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__doingbayesiandataa
> nalysis.blogspot.com_2017_12_which-2Dmovie-2Dis-2Drated-2Dbetter-2Ddon
> t-2Dtreat.html&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7Zip
> xyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=cTggjZQJ8ytLt2Oe_1
> EsE6augDSd6GEeTbNeGKtCtjA&e=
> https://urldefense.proofpoint.com/v2/url?u=http-3A__doingbayesiandataa
> nalysis.blogspot.com_2018_09_analyzing-2Dordinal-2Ddata-2Dwith-2Dmetri
> c.html&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uL
> hmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=mhVwPoXW5lRnvpxuImiQd-Dfcb
> kWvVrws6eDG9BhW4I&e=
> The latter is largely a link/"press release" for the associated paper:
> Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with 
> metric models: What could possibly go wrong? Journal of Experimental 
> Social Psychology , 79 , 328–348. doi:10.1016/j.jesp.2018.08.009
> Finally, have you tried other link and threshold functions in clmm?
> Those can make a huge difference!
> Phillip
> On 5/3/19 11:00 am, Nicolas Deguines wrote:
> > Hello everyone,
> >
> > I am investigating how engagement into a citizen science program can
> change
> > participants' behavior in terms of implementing gardening techniques 
> > benefitting biodiversity.
> > There are 2362 participants, distributed into 7 cohorts (= year in 
> > which they joined the program), and I have repeated gardening 
> > technique information for multiple years for each participant.
> > So I need to use mixed modeling.
> >
> > One of the response variable is a score that can takes 5 values: 0, 
> > 1, 2, 3, or 4. It's ordered, it's not continuous (there are 5 levels).
> > I would analyze this into a cumulative link mixed models (using 
> > clmm()
> from
> > ordinal package) but the Hessian condition I obtained with such 
> > model is
> >
> > 5.0e+06. I.e. assumption is violated (simplifying my initial full 
> > model
> did
> > not help at all).
> >
> > As an alternative, I am wondering if I could treat this response 
> > variable has a continuous one into a lmer() model.
> > When I do:
> > - Normality of model residuals is nicely met
> > - Homoscedasticity of model residuals is met as well.
> > => does meeting these two assumptions is enough to validate the use 
> > of a
> > lmer() model for an ordered categorical response variable?
> >
> > In one of Douglas Bates' presentation (slide 3 of Jan. 2011, Madison:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lme4.r-2Dforge.r
> > -2Dproject.org_slides_2011-2D01-2D11-2DMadison_5GLMM.pdf&d=DwIFaQ&c=
> > nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0bi
> > VanGF_OI9s5JNvn6ENYnsaenI&s=gQt2neGX64PqU4v8CFD2g0WdUb_EycwwXjSSo4db
> > LDY&e=),
> it
> > is said that
> > "When using LMMs we assume that the response being modeled is on a 
> > continuous scale.
> > Sometimes we can bend this assumption a bit if the response is an 
> > ordinal response with a moderate to large number of levels.
> > For example, [...a response variable taking] integer values on the 
> > scale
> of
> > 1 to 10."
> > => is 5 levels too few to be treated as continuous? Or would it be 
> > ok
> given
> > residuals behave nicely?
> >
> > I would appreciate any help and thoughts on this.
> > I checked that this was not treated in a previous post and I hope I 
> > did
> not
> > miss it (sorry if I did).
> >
> > Best,
> > Nicolas Deguines
> > ----------------------------------
> > Postdoctoral Research Associate
> > Laboratoire Ecologie, Systématique et Evolution Université Paris 
> > Sud, Orsay, France
> > Website: 
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__nicolasdeguines.
> > weebly.com_&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7Zipx
> > yQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=umRU32TJsniNhkwfb
> > YgkiGyv_zmKT1LJU9qtyEGdVXg&e=
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-mixed-models using r-project.org mailing list 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_ma
> > ilman_listinfo_r-2Dsig-2Dmixed-2Dmodels&d=DwIFaQ&c=nE__W8dFE-shTxStw
> > Xtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6E
> > NYnsaenI&s=v-ubYR1CYdeewDkWeQmOxEvkKJ4LF-vs8O0dNas2S8Q&e=
> >

	[[alternative HTML version deleted]]

R-sig-mixed-models using r-project.org mailing list https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dmixed-2Dmodels&d=DwIGaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=bJ2WeOugjTjdNv0JHhX3mbV-NYORMqKlREE07FtWxDk&s=NIMa3Dk2PuImseyI9fe9_mHHtE_cVvs8yVALMBHOHQc&e=

More information about the R-sig-mixed-models mailing list