[R-sig-ME] Mixed-model polytomous ordered logistic regression ?

Rune Haubo rune.haubo at gmail.com
Tue Jan 4 10:45:46 CET 2011

On 3 January 2011 22:53, Emmanuel Charpentier <emm.charpentier at free.fr> wrote:
> Le lundi 03 janvier 2011 à 19:54 +0100, Rune Haubo a écrit :
>> Hi Emmanuel,
>>
>> You can use the clmm() function in package ordinal to fit cumulative
>> link mixed models which include the proportional odds mixed model that
>> you mention. The non-mixed equivalent, clm() gives the same answers as
>> polr() and glm() on your example data.
>>
>> clmm() currently only allows for one random intercept, but Ken
>> Knoblauch proposed a polmer() function
>> (http://finzi.psych.upenn.edu/R-sig-mixed-models/2010q2/003778.html)
>> that seem to be an extension of glmer() similar to your glm() hack,
>> but for mixed models.
>
> That was my point, and Ken Kornblauch has been kind enough to send me
> his code, which seems quite good. I have not yet worked on his code (and
> probably won't for the next month at least), but it's probably a good
> start.
>
> The "800 degrees of freedom" problem still bother me. But that's another
> problem (see below).

Aren't the 800 df in the null model just an artifact of the fitting
algorithm? glm() counts the null df as the no. rows in the design
matrix AFAIR which corresponds to the 'Dataset' data.frame you
copied and stacked to accommodate fitting with glm(). What would you
use or interpret these (100 or 800) df for? Some people want to use
the residual deviance and corresponding residual df for testing
goodness of fit, which can make sense if observations are binomial
with large enough expected frequencies. In other models like CLMs we
have to look at the multinomial table of observed outcomes and
corresponding table of fitted frequencies, and then compare these with
deviance or Pearson GOF statistics directly. This involves counting
the no. degrees of freedom from the multinomial table adjusting for
row sum constraints.

In your example data each multinomial (vector) observation consists of
a single 1 and eight zeros, and the situation is similar to a binomial
glm with binary observations; the expected frequencies are not large
enough to give meaningful GOF statistics.

>
>> This allows for several random effects in a
>> glmer() fashion. On the other hand, with clmm() you get a variety of
>> link functions and extensions to scale and nominal effects for
>> modelling non-proportional effects, structured thresholds etc.
>
> Thank you for the hint. I'll have a look at that. I have trouble
> remembering which of the 2500+ R packages available on CRAN will solve
> my problem-of-the-moment. Furthermore, I tend to trust more packages
> doing something I've personally tinkered with. Hence my current
> infatuation with BUGS (i. e. JAGS for the time being)...
>
> But my point wasn't to solve a specific problem, but to point out that
> ordinal data are a sufficiently frequent case to deserve "first-class
> treatment". I am old enough to have been teached what was then called
> "the general linear model", which was about the only case whee
> multivariate model could be realistically fitted. Things tended to exa,d
> a bit with censored data and the introduction of the Cox model, then to
> Poisson and logistic regression. I tend to think af ordered data as
> another kind of data that should be analysable the same way. That's what
> polr() partially does (with a limited but wisely choosen set of
> options).
>
> Similarly, mixed models became *practically* fittable first for the
> dreaded Gaussian case, then for Poisson, binomial and negative-binomial
> cases. Ordered data should be analysable the same way.

I couldn't agree more with these points. Agresti mentions in several
of his reviews of regression models for (mixed) ordinal data that he
finds them underused. This is probably due to lack of easy-to-use
software and lack of familiarity with the model class - as you say,
regression models for ordinal data are often not mentioned until after
generalized linear models and survival models, and many applied
researchers don't get this far.

I would be happy to work on an implementation of cumulative link mixed
models for more general random effects structures and put it in the
ordinal package. Any collaboration on this welcome.

>
> Another case (not directly in my current interests, but one never
> knows...) is categorical data. Log-linear models allow for a similar
> treatment, and should be (too) analysable in the mixed models case.
>
> A lot is to be learned from the Bayesian models for this kind of
> problems. I'm currently exploring the (now somewhat outdated) Congdon's
> textbooks on Bayesian modeling, and it raises interesting ideas bout
> what should be doable from a frequentist point of view...
>
> On the other hand, I'm more and more inclined to think that the Bayesian
> point of view is extremely valid. The first difficulty is to convince
> journals' editors that "p-value is crack" and that a posterior joint
> distribution says all that can be learned about from an
> experiment/observation, provided the priors are correctly chosen. The
> second (and harder) is to find ways to express Bayesian probability
> models in a more concise way than what is currently allowed by BUGS :
> one should not have to type dozens of almost-identical snippets of code
> for each and every intermediate parameter appearing in such  model. My
> current idea is tat some sort of a macro facility should help, but is
> hard to design *correctly*. The third (and hardest) difficulty is to
> implement that efficiently. I'm a bit stricken to see all "public" BUGS
> implementations using only one of a 4- ou 8-CPU machine, even for parts
> (model updating of parallel chains) that are "embarrassingly
> parallel" ((C) D. Spiegelhalter & al.).
>
> In our present case, the Bayesian answer "feels" more reasonable than
> the answer given by polr/glm, in particular because the variability
> estimation of the thresholds seems to avoid the artefact induced by a
> non-centered independent variable...

The polr/clm/glm solution is a pure likelihood solution, and the
standard errors are simply obtained from the Hessian of the likelihood
function wrt. the parameters. In the sense that a Bayesian solution is
prior + likelihood, the difference between the Bayesian and likelihood
solutions are caused by the information in the chosen prior. It seems
to me that the Bayesian solution depends on the centering of the
independent variable, X - is that right? Would you get approximately
the same results from the likelihood and Bayesian approaches if your
independent variable, X was appropriately centrered?

I agree that Bayesian thinking has a lot to offer as a contrast to the
p-value obsession of frequentist thinking, but on the other hand I am
not particularly fond of contaminating "what the data say" with an
unmeasurable amount of prior belief - at least that is not relevant in
the analyses I conduct. I therefore prefer to stay with a non-prior
Bayesian solution, i.e. the likelihood function.

On the other hand model identification is often important in my
analyses, and this is why the summary method for clm and clmm objects
return the condition number of the Hessian. This is to the best of my
knowledge not available in a Bayesian analysis. If this number tends
to infinity, some parameters are redundant, while if it is only large,
but not infinity, the model is close to unidentifiable or 'empirically
unidentifiable'. For your example data the condition number is

> summary(fm1 <- clm(Cat ~ X, data = Data))
....
Condition number of Hessian: 152903.78

which is relatively large. Looking at the threshold estimates they are
almost equidistant, so we could try to simplify the model by requiring
that the thresholds are equidistant or symmetric rather than just free
to vary:

fm2 <- clm(Cat ~ X, data = Data, threshold = "symmetric")
fm3 <- clm(Cat ~ X, data = Data, threshold = "equidistant")
> anova(fm1,fm2, fm3)
Likelihood ratio tests of cumulative link models

Response: Cat
Model Resid. df -2logLik   Test    Df  LR stat.   Pr(Chi)
1 X |  |         97 56.37787
2 X |  |         94 55.95107 1 vs 2     3 0.4268071 0.9346508
3 X |  |         91 55.73970 2 vs 3     3 0.2113667 0.9757338

so it is possible to save 6 parameters and obtain almost the same
likelihood. The following summary shows that the first threshold is
almost the same as in the original model and the the following seven
thresholds are separated by 9.37 units. Inference for the X covariate
changes only little. The condition number of the Hessian is much
better and the model gives a more succinct description of the data.
The point is that we are not warned of the near unidentifiability in
the Bayesian analysis.

> summary(fm3)
Call:
clm(location = Cat ~ X, data = Data, threshold = "equidistant")

Location coefficients:
Estimate Std. Error z value Pr(>|z|)
X  9.2587   1.7012     5.4425 5.2550e-08

No scale coefficients

Threshold coefficients:
Estimate Std. Error z value
threshold.1 14.2865   2.7275     5.2380
spacing      9.3689   1.7173     5.4557

log-likelihood: -28.18894
AIC: 62.37787
Condition number of Hessian: 3801.074

If this were observations from a questionnaire it would be possible to
infer that the respondents perception of the underlying rating scale
is approximately linear - that the perceptual distance between
adjacent categories is approximately the same. This is sometimes
stated as an (untested) assumption to justify the use of linear models
for ordinal ratings data, and indeed a linear model gives the same
qualitative results for these data where the assumption is fulfilled:
> summary(lm(as.numeric(Cat) ~ X, data = Data))
....
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.14022    0.06417   2.185   0.0313 *
X            0.95977    0.01251  76.717   <2e-16 ***
---
where the reference scale differs approximately a factor 10 equal to
the threshold distance.

>
>> The idea of reformulating the cumulative link model as a binomial
>> generalized linear model is good, but I haven't seen it described
>> before - can I ask you which exercise in what edition of Gelman & Hill
>> (2007) that mentions this?
>
> Ex 15.1, section 15.5 of Gelman & Hill (2007), p. 342. The printer's
> info at the beginning of the book (just after the full title page, but I
> don't know how this page is called in American usage) says it's "First
> published 2007//Reprinted with corrections 2007//10th printing 2009"

Thanks!

Cheers,
Rune