[R] Behavior of ordered factors in glm
Duncan Murdoch
murdoch at stats.uwo.ca
Sun Jan 6 03:06:13 CET 2008
On 05/01/2008 7:16 PM, David Winsemius wrote:
> David Winsemius <dwinsemius at comcast.net> wrote in
> news:Xns9A1CC05755274dNOTwinscomcast at 80.91.229.13:
>
>> I have a variable which is roughly age categories in decades. In the
>> original data, it came in coded:
>>> str(xxx)
>> 'data.frame': 58271 obs. of 29 variables:
>> $ issuecat : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1 1
>> 1...
>> snip
>>
>> I then defined issuecat as ordered:
>>> xxx$issuecat<-as.ordered(xxx$issuecat)
>> When I include issuecat in a glm model, the result makes me think I
>> have asked R for a linear+quadratic+cubic+quartic polynomial fit.
>> The results are not terribly surprising under that interpretation,
>> but I was hoping for only a linear term (which I was taught to
>> call a "test of trend"), at least as a starting point.
>>
>>> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson")
>>> summary(age.mdl)
>> Call:
>> glm(formula = actual ~ issuecat, family = "poisson", data = xxx)
>>
>> Deviance Residuals:
>> Min 1Q Median 3Q Max
>> -0.3190 -0.2262 -0.1649 -0.1221 5.4776
>>
>> Coefficients:
>> Estimate Std. Error z value Pr(>|z|)
>> (Intercept) -4.31321 0.04865 -88.665 <2e-16 ***
>> issuecat.L 2.12717 0.13328 15.960 <2e-16 ***
>> issuecat.Q -0.06568 0.11842 -0.555 0.579
>> issuecat.C 0.08838 0.09737 0.908 0.364
>> issuecat^4 -0.02701 0.07786 -0.347 0.729
>>
>> This also means my advice to a another poster this morning may have
>> been misleading. I have tried puzzling out what I don't understand
>> by looking at indices or searching in MASSv2, the Blue Book,
>> Thompson's application of R to Agresti's text, and the FAQ, so far
>> without success. What I would like to achieve is having the lowest
>> age category be a reference category (with the intercept being the
>> log-rate) and each succeeding age category be incremented by 1. The
>> linear estimate would be the log(risk-ratio) for increasing ages. I
>> don't want the higher order polynomial estimates. Am I hoping for
>> too much?
>>
>
> I acheived what I needed by:
>
>> xxx$agecat<-as.numeric(xxx$issuecat)
>> xxx$agecat<-xxx$agecat-1
>
> The results look quite sensible:
>> exp.mdl<-glm(actual~gendercat+agecat+smokecat, data=xxx,
> family="poisson", offset=expected)
>> summary(exp.mdl)
>
> Call:
> glm(formula = actual ~ gendercat + agecat + smokecat, family =
> "poisson",
> data = xxx, offset = expected)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -0.5596 -0.2327 -0.1671 -0.1199 5.2386
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -5.89410 0.11009 -53.539 < 2e-16 ***
> gendercatMale 0.29660 0.06426 4.615 3.92e-06 ***
> agecat 0.66143 0.02958 22.360 < 2e-16 ***
> smokecatSmoker 0.22178 0.07870 2.818 0.00483 **
> smokecatUnknown 0.02378 0.08607 0.276 0.78233
>
> I remain curious about how to correctly control ordered factors, or I
> should just simply avoid them.
If you're using a factor, R generally assumes you mean each level is a
different category, so you get levels-1 parameters. If you don't want
this, you shouldn't use a factor: convert to a numeric scale, just as
you did.
Duncan Murdoch
More information about the R-help
mailing list