[R] Behavior of ordered factors in glm
David Winsemius
dwinsemius at comcast.net
Sun Jan 6 01:16:47 CET 2008
David Winsemius <dwinsemius at comcast.net> wrote in
news:Xns9A1CC05755274dNOTwinscomcast at 80.91.229.13:
> I have a variable which is roughly age categories in decades. In the
> original data, it came in coded:
>> str(xxx)
> 'data.frame': 58271 obs. of 29 variables:
> $ issuecat : Factor w/ 5 levels "0 - 39","40 - 49",..: 1 1 1
> 1...
> snip
>
> I then defined issuecat as ordered:
>> xxx$issuecat<-as.ordered(xxx$issuecat)
>
> When I include issuecat in a glm model, the result makes me think I
> have asked R for a linear+quadratic+cubic+quartic polynomial fit.
> The results are not terribly surprising under that interpretation,
> but I was hoping for only a linear term (which I was taught to
> call a "test of trend"), at least as a starting point.
>
>> age.mdl<-glm(actual~issuecat,data=xxx,family="poisson")
>> summary(age.mdl)
>
> Call:
> glm(formula = actual ~ issuecat, family = "poisson", data = xxx)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -0.3190 -0.2262 -0.1649 -0.1221 5.4776
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -4.31321 0.04865 -88.665 <2e-16 ***
> issuecat.L 2.12717 0.13328 15.960 <2e-16 ***
> issuecat.Q -0.06568 0.11842 -0.555 0.579
> issuecat.C 0.08838 0.09737 0.908 0.364
> issuecat^4 -0.02701 0.07786 -0.347 0.729
>
> This also means my advice to a another poster this morning may have
> been misleading. I have tried puzzling out what I don't understand
> by looking at indices or searching in MASSv2, the Blue Book,
> Thompson's application of R to Agresti's text, and the FAQ, so far
> without success. What I would like to achieve is having the lowest
> age category be a reference category (with the intercept being the
> log-rate) and each succeeding age category be incremented by 1. The
> linear estimate would be the log(risk-ratio) for increasing ages. I
> don't want the higher order polynomial estimates. Am I hoping for
> too much?
>
I acheived what I needed by:
> xxx$agecat<-as.numeric(xxx$issuecat)
> xxx$agecat<-xxx$agecat-1
The results look quite sensible:
> exp.mdl<-glm(actual~gendercat+agecat+smokecat, data=xxx,
family="poisson", offset=expected)
> summary(exp.mdl)
Call:
glm(formula = actual ~ gendercat + agecat + smokecat, family =
"poisson",
data = xxx, offset = expected)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5596 -0.2327 -0.1671 -0.1199 5.2386
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.89410 0.11009 -53.539 < 2e-16 ***
gendercatMale 0.29660 0.06426 4.615 3.92e-06 ***
agecat 0.66143 0.02958 22.360 < 2e-16 ***
smokecatSmoker 0.22178 0.07870 2.818 0.00483 **
smokecatUnknown 0.02378 0.08607 0.276 0.78233
I remain curious about how to correctly control ordered factors, or I
should just simply avoid them.
--
David Winsemius
More information about the R-help
mailing list