[R-sig-ME] Is it ok to use lmer() for an ordered categorical (5 levels) response variable?

Pierce, Steven p|erce@1 @end|ng |rom m@u@edu
Wed Mar 6 15:37:54 CET 2019


So your score is constructed by summing 4 binary variables, each of which represents a feature of a natural environment. This is similar to how social science researchers use survey data with binary questions that get combined into scale scores. Such scale scores are usually intended to measure latent variables. 

By taking a simple sum of the binary items as the scale score, you are making strong assumptions about each item being equally important to measuring the latent variable. That's a testable hypothesis. Consider doing a confirmatory factor analysis using the 4 items as indicators of a latent variable (naturalness) that is assumed to be continuous and normally distributed. The point of this is to get a decent measurement model for the latent variable. 

Allow the factor loadings to vary across items (some may be more important to measuring the latent variable than others); also test for whether there are correlations between the item residuals that are not explained by being indicators of the latent variable. If you can achieve adequate CFA model fit and composite reliability for the latent variable (i.e., naturalness), then you have the ability to save out factor score estimates of the latent variable that should be continuous, normally distribute values. 

Then you can either take those factor score estimates and plug them into your mixed model, or just expand from a confirmatory factor analysis model into a full structural equation model that includes all the elements of the desired mixed model, with the latent "naturalness" variable as the outcome. The latter approach is more theoretically desirable because it reduces bias due to measurement error. 

Steven J. Pierce, Ph.D.
Acting Director; Associate Director
Center for Statistical Training & Consulting (CSTAT)
Michigan State University
E-mail: pierces1 using msu

-----Original Message-----
From: Nicolas Deguines <nicodeguines using gmail.com> 
Sent: Wednesday, March 6, 2019 9:01 AM
To: r-sig-mixed-models using r-project.org
Subject: Re: [R-sig-ME] Is it ok to use lmer() for an ordered categorical (5 levels) response variable?

Hello Phillip and all,

Thanks a lot Phillip for your very interesting and useful answer, and for
the paper from Liddell & Kruschke. It helps a lot.

About trying other link and threshold functions in clmm: no huge difference
in my case unfortunately. I tried different combinations of each.
'equidistant' did do better, but the improvement was far from enough.

I computed density plots for my response variable as observed and as
predicted from my lmer() model (similar to what Liddell and Kruschke do in
Figure 6): the linear mixed-model does pretty well in fitting the data.
=> so I'd be enclined to trust the results from my lmer models in the
present case (but Liddell and Kruschke did show very clear cases when a
linear model fit very poorly the ordinal data).

Meanwhile, I thought of another alternative for analyzing this response
variable and I would be curious to read what people may think about it.
Before presenting that alternative, I need to say more about that 5-levels
response variable.
It is a score built by Muratet and Fontaine (2015)* to assess the
naturalness of a given private backyard (it is shown to be correlated with
higher abundance of butterflies).
In the backyard: fallow area, nettles (*Urtica dioica*), ivy (*Hedera helix*),
and brambles (*Rubus spp.*) are each scored one if present, and the
naturalness index was computed as the sum of these scores.
=> it results in a 5-levels ordinal variable because it can go from 0 to 4,
and each increase in 1 means a backyard with more features of 'naturalness'.
I wonder thus if this could be modelled using a glmer() with family =
binomial and feeding to the model two columns: cbind(sum of 1's, sum of
0's) (see R documentation for family{stats}, in the Details: "*As a
two-column integer matrix: the first column gives the number of successes
and the second the number of failures.*")
I will try and see how the model fit the data. But I would be interested in
getting a theoretical opinion.

I hope this can help others too

Best regards,
Nicolas Deguines


Postdoctoral Research Associate
Laboratoire Ecologie, Systématique et Evolution
Université Paris Sud, Orsay, France
Website: https://urldefense.proofpoint.com/v2/url?u=http-3A__nicolasdeguines.weebly.com_&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=umRU32TJsniNhkwfbYgkiGyv_zmKT1LJU9qtyEGdVXg&e=

On Tue, 5 Mar 2019 at 13:04, Phillip Alday <phillip.alday using mpi.nl> wrote:

> Hi Nicolas,
> How much you can get away bending the assumptions depends in some ways
> on how well the resulting model fits your data. If the resulting model
> is a poor fit, then it's not a great model for performing inference. The
> other problem with bending assumptions is that a lot of 'error
> statistics' (standard errors, t-values, and basically anything related
> to significance testings) aren't guaranteed to do what they are supposed
> to do. (In your case, the good behavior of your residuals suggests that
> this won't be a huge problem, but there are no promises.)
> You can get around this a bit by doing things like cross-validation or
> other inferential steps based on how well the model generalizes to /
> predicts new data instead of significance testing of coefficients or
> linear hypotheses.
> John Kruschke has written about this issue at some length and seems
> convinced that it's (almost) always a bad idea to bend the
> metric/continuous assumption when dealing with ordinal data:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__doingbayesiandataanalysis.blogspot.com_2017_12_which-2Dmovie-2Dis-2Drated-2Dbetter-2Ddont-2Dtreat.html&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=cTggjZQJ8ytLt2Oe_1EsE6augDSd6GEeTbNeGKtCtjA&e=
> https://urldefense.proofpoint.com/v2/url?u=http-3A__doingbayesiandataanalysis.blogspot.com_2018_09_analyzing-2Dordinal-2Ddata-2Dwith-2Dmetric.html&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=mhVwPoXW5lRnvpxuImiQd-DfcbkWvVrws6eDG9BhW4I&e=
> The latter is largely a link/"press release" for the associated paper:
> Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with
> metric models: What could possibly go wrong? Journal of Experimental
> Social Psychology , 79 , 328–348. doi:10.1016/j.jesp.2018.08.009
> Finally, have you tried other link and threshold functions in clmm?
> Those can make a huge difference!
> Phillip
> On 5/3/19 11:00 am, Nicolas Deguines wrote:
> > Hello everyone,
> >
> > I am investigating how engagement into a citizen science program can
> change
> > participants' behavior in terms of implementing gardening techniques
> > benefitting biodiversity.
> > There are 2362 participants, distributed into 7 cohorts (= year in which
> > they joined the program), and I have repeated gardening technique
> > information for multiple years for each participant.
> > So I need to use mixed modeling.
> >
> > One of the response variable is a score that can takes 5 values: 0, 1, 2,
> > 3, or 4. It's ordered, it's not continuous (there are 5 levels).
> > I would analyze this into a cumulative link mixed models (using clmm()
> from
> > ordinal package) but the Hessian condition I obtained with such model is
> >
> > 5.0e+06. I.e. assumption is violated (simplifying my initial full model
> did
> > not help at all).
> >
> > As an alternative, I am wondering if I could treat this response variable
> > has a continuous one into a lmer() model.
> > When I do:
> > - Normality of model residuals is nicely met
> > - Homoscedasticity of model residuals is met as well.
> > => does meeting these two assumptions is enough to validate the use of a
> > lmer() model for an ordered categorical response variable?
> >
> > In one of Douglas Bates' presentation (slide 3 of Jan. 2011, Madison:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lme4.r-2Dforge.r-2Dproject.org_slides_2011-2D01-2D11-2DMadison_5GLMM.pdf&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=gQt2neGX64PqU4v8CFD2g0WdUb_EycwwXjSSo4dbLDY&e=),
> it
> > is said that
> > "When using LMMs we assume that the response being modeled is on a
> > continuous scale.
> > Sometimes we can bend this assumption a bit if the response is an ordinal
> > response with a moderate to large number of levels.
> > For example, [...a response variable taking] integer values on the scale
> of
> > 1 to 10."
> > => is 5 levels too few to be treated as continuous? Or would it be ok
> given
> > residuals behave nicely?
> >
> > I would appreciate any help and thoughts on this.
> > I checked that this was not treated in a previous post and I hope I did
> not
> > miss it (sorry if I did).
> >
> > Best,
> > Nicolas Deguines
> > ----------------------------------
> > Postdoctoral Research Associate
> > Laboratoire Ecologie, Systématique et Evolution
> > Université Paris Sud, Orsay, France
> > Website: https://urldefense.proofpoint.com/v2/url?u=http-3A__nicolasdeguines.weebly.com_&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=umRU32TJsniNhkwfbYgkiGyv_zmKT1LJU9qtyEGdVXg&e=
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-mixed-models using r-project.org mailing list
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dmixed-2Dmodels&d=DwIFaQ&c=nE__W8dFE-shTxStwXtp0A&r=91SB6keVyEb7FtX7ZipxyQ&m=uLhmTfp88PWyhBj0biVanGF_OI9s5JNvn6ENYnsaenI&s=v-ubYR1CYdeewDkWeQmOxEvkKJ4LF-vs8O0dNas2S8Q&e=
> >

	[[alternative HTML version deleted]]

More information about the R-sig-mixed-models mailing list