[R-sig-ME] trouble specifying cumulative-link mixed model

Mon Jun 29 20:33:55 CEST 2015

pj

On Sat, Jun 27, 2015 at 6:57 PM, Dan McCloy <drmccloy at uw.edu> wrote:
> I'm having a little trouble figuring out how to set up my model. The data
> are about pronunciation of vowel sounds in speech: 21556 observations of an
> ordered categorical outcome ("none", "devoiced", "deleted"). I *think* what
> I want is a cumulative-link mixed model, which I can get with the ordinal
> package's clmm() function. However, the outcome is very unbalanced (which
> may or may not be the source of my problems):
>
> table(cleandata$reduction)
> ##     none devoiced  deleted
> ##    20776      360      420
>

Your outcome is so imbalanced you might fall into the "rare events"
end of the spectrum.   We know logit models are biased, but we also
have pretty good tools for correcting the coefficients.  If you Google
"rare events logistic" you should come to a tempest-in-a-teapot type
of debate.  This note by Paul Allison rings true.

http://statisticalhorizons.com/logistic-regression-for-rare-events

The caution, however, is that it is almost impossible to get a
predicted value that is anything except "none" in this kind of data.

> I have random effects for nuisance variables speaker (n=6) and word
> (n=1204). Observations within these two are fairly well-balanced (i.e.,
> there is not much missing data; the vast majority of speaker-word pairings
> have exactly 3 observations):
>
> table(with(cleandata, table(word, speaker)))
> ##    0    1    2    3
> ##   14   11   52 7147
>

On the other hand, there is no happy answer for a grossly imbalanced
predictor. If that is the predictor, anyway.  It is virtually
impossible to say that being in category 3 has a significant effect
compared to category 2 when there are 7147 cases in 1 and 52 in the
other.

You should make a small simulation that generates ordinal data with
various predictors.  See if clmm gives you answers you believe/expect.
Then push the simulation toward the edges of unanimity, and see what
clmm gives you.  I can say, almost with certainty, you will see the
sampling distribution of the estimated coefficients get wider and
wider as the imbalance grows, and if you do that thing people always
do of choosing the ones with "good" p-values, then the ones you keep
will be badly biased.

> What I really care about is preceding consonant type (6 levels), following
> consonant type (also 6 levels), and whether following consonant is itself
> followed by another consonant ("coda", binary).  Another predictor we expect
> to be important is whether or not preceding and/or following consonant are
> "aspirated", which is a property of only 2 of the 6 levels. The identity of
> the vowel itself (10 levels) is not of primary interest but definitely
> ought to be included; I am open to including it as a random effect, but
> slightly prefer being able to see estimates for each level of the vowel if
> possible.
>
> What I'm struggling with is how to specify the fixed effects. If I include
> everything (precedingCons + followingCons * coda + aspirated + random
> effects), I get "numerically singular Hessian" problems, regardless of
> whether I specify preceding / following consonant as factors, or set up
> binary variables like "preceding.stop", "following.stop",
> "preceding.fricative", etc. (which I think are equivalent anyway, since the
> factor was treatment-coded).  I can get the model to converge if I do the
> binary variables method but only include 3 of the 6 levels for preceding
> and following consonant (plus "aspirated" & random effects).
>
This means your predictors are redundant. Not necessarily
theoretically, but in the data.  Use model.matrix to get a look at
your predictors as R wants to build your design matrix. I guarantee
some rows will be indentical or very similar.

You can manufacture new categories of predictors to simplify.  Here is
example of problem you likely have

1 1 1 1
1 0 0 0
1 0 0 0
1 0 0 0
1 0 1 1
1 0 1 1
1 0 1 1

We never get variable 1 with any value but 1.  We never get variable 3
without also a 1 on variable 4.  The only predictor that is separately
useful is the 2nd column.

> My questions:
> 1. is CLMM the right modeling choice?  In principle I can collapse
> "reduction" to binary and do a logit-link glmm, by collapsing "devoiced"
> and "deleted" into one category, but really don't want to have to resort to
> that.
> 2. is the imbalance in my outcome causing the problems with the modeling?
> Do I need some sort of zero-inflation model (which I've heard talked about
> on this list, but don't really understand yet)?
> 3. any suggestions for how to specify the fixed effects (i.e., factor
> coding)?
>
> Some additional tables showing distribution of the response levels are
> included in a GitHub gist here:
> https://gist.github.com/drammock/bf7a6d634bbd179b328f
>
> thanks,
> -- dan
>
> Daniel McCloy
> http://dan.mccloy.info/
> Postdoctoral Research Fellow
> Institute for Learning and Brain Sciences
> University of Washington
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-- 
Paul E. Johnson
Professor, Political Science        Director
1541 Lilac Lane, Room 504      Center for Research Methods
University of Kansas                 University of Kansas
http://pj.freefaculty.org              http://crmda.ku.edu