[R] Behavior of ordered factors in glm

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Jan 6 08:09:05 CET 2008


Further to Duncan's comments, you can control factor codings via 
options(contrasts=), by setting contrasts() on the factor and via C().
This does enable you to code an ordered factor as a linear term, for 
example.

The only place I know that this is discussed in any detail is in Bill 
Venables' account in MASS chapter 6.

On Sat, 5 Jan 2008, Duncan Murdoch wrote:

> On 05/01/2008 7:16 PM, David Winsemius wrote:
>> David Winsemius <dwinsemius at comcast.net> wrote in
>> news:Xns9A1CC05755274dNOTwinscomcast at 80.91.229.13:
>>
>>> I have a variable which is roughly age categories in decades. In the
>>> original data, it came in coded:
>>>> str(xxx)
>>> 'data.frame':   58271 obs. of  29 variables:
>>>  $ issuecat   : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1  1
>>>  1...
>>> snip
>>>
>>> I then defined issuecat as ordered:
>>>> xxx$issuecat<-as.ordered(xxx$issuecat)
>>> When I include issuecat in a glm model, the result makes me think I
>>> have asked R for a linear+quadratic+cubic+quartic polynomial fit.
>>> The results are not terribly surprising under that interpretation,
>>> but I was hoping for only a linear term (which I was taught to
>>> call a "test of trend"), at least as a starting point.
>>>
>>>> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson")
>>>> summary(age.mdl)
>>> Call:
>>> glm(formula = actual ~ issuecat, family = "poisson", data = xxx)
>>>
>>> Deviance Residuals:
>>>     Min       1Q   Median       3Q      Max
>>> -0.3190  -0.2262  -0.1649  -0.1221   5.4776
>>>
>>> Coefficients:
>>>             Estimate Std. Error z value Pr(>|z|)
>>> (Intercept) -4.31321    0.04865 -88.665   <2e-16 ***
>>> issuecat.L   2.12717    0.13328  15.960   <2e-16 ***
>>> issuecat.Q  -0.06568    0.11842  -0.555    0.579
>>> issuecat.C   0.08838    0.09737   0.908    0.364
>>> issuecat^4  -0.02701    0.07786  -0.347    0.729
>>>
>>> This also means my advice to a another poster this morning may have
>>> been misleading. I have tried puzzling out what I don't understand
>>> by looking at indices or searching in MASSv2, the Blue Book,
>>> Thompson's application of R to Agresti's text, and the FAQ, so far
>>> without success. What I would like to achieve is having the lowest
>>> age category be a reference category (with the intercept being the
>>> log-rate) and each succeeding age category  be incremented by 1. The
>>> linear estimate would be the log(risk-ratio) for increasing ages. I
>>> don't want the higher order polynomial estimates. Am I hoping for
>>> too much?
>>>
>>
>> I acheived what I needed by:
>>
>>> xxx$agecat<-as.numeric(xxx$issuecat)
>>> xxx$agecat<-xxx$agecat-1
>>
>> The results look quite sensible:
>>> exp.mdl<-glm(actual~gendercat+agecat+smokecat, data=xxx,
>> family="poisson", offset=expected)
>>> summary(exp.mdl)
>>
>> Call:
>> glm(formula = actual ~ gendercat + agecat + smokecat, family =
>> "poisson",
>>     data = xxx, offset = expected)
>>
>> Deviance Residuals:
>>     Min       1Q   Median       3Q      Max
>> -0.5596  -0.2327  -0.1671  -0.1199   5.2386
>>
>> Coefficients:
>>                 Estimate Std. Error z value Pr(>|z|)
>> (Intercept)     -5.89410    0.11009 -53.539  < 2e-16 ***
>> gendercatMale    0.29660    0.06426   4.615 3.92e-06 ***
>> agecat           0.66143    0.02958  22.360  < 2e-16 ***
>> smokecatSmoker   0.22178    0.07870   2.818  0.00483 **
>> smokecatUnknown  0.02378    0.08607   0.276  0.78233
>>
>> I remain curious about how to correctly control ordered factors, or I
>> should just simply avoid them.
>
> If you're using a factor, R generally assumes you mean each level is a
> different category, so you get levels-1 parameters.  If you don't want
> this, you shouldn't use a factor:  convert to a numeric scale, just as
> you did.
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595




More information about the R-help mailing list