[R] logistic regression model specification
Dylan Beaudette
dylan.beaudette at gmail.com
Wed Nov 14 01:17:45 CET 2007
On Tuesday 13 November 2007, Peter Dalgaard wrote:
> Prof Brian Ripley wrote:
> > On Tue, 13 Nov 2007, Dylan Beaudette wrote:
> >> Hi,
> >>
> >> I have setup a simple logistic regression model with the glm() function,
> >> with the follow formula:
> >>
> >> y ~ a + b
> >>
> >> where:
> >> 'a' is a continuous variable stratified by
> >> the levels of 'b'
> >>
> >>
> >> Looking over the manual for model specification, it seems that
> >> coefficients for unordered factors are given 'against' the first level
> >> of that factor.
> >
> > Only for the default coding.
> >
> >> This makes for difficult interpretation when using factor 'b' as a
> >> stratifying model term.
> >
> > Really? You realize that you have not 'stratified' on 'b', which would
> > need the model to be a*b? What you have is a model with parallel linear
> > predictors for different levels of 'b', and if the coefficients are not
> > telling you what you want you should change the coding.
Thanks for the comments Peter. Note that use/abuse of jargon on my part is
augmented by my limited understanding.
> I have to differ slightly here. "Stratification", at least in the fields
> that I connect with, usually means that you combine information from
> several groups via an assumption that they have a common value of a
> parameter, which in the present case is essentially the same as assuming
> an additive model y~a+b.
This was the definition of 'stratification' that I was using when formulating
this model.
> I share your confusion as to why the parametrization of the effects of
> factor b should matter, though. Surely, the original poster has already
> noticed that the estimated effect of a is the same whether or not the
> intercept is included? The only difference I see is that the running
> anova() or drop1() would not give interesting results for the effect of
> b in the no-intercept variation.
Yes, I have noticed that the estimated effect is the same. It looks like I am
having trouble interpreting the model formula and coefficients. An example
from MASS, using the default R contrasts:
library(MASS)
data(whiteside)
# simple additive model, relative to first level of 'Insul'
# 'parallel regressions'
summary(lm(Gas ~ Temp + Insul, data=whiteside))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.55133 0.11809 55.48 <2e-16 ***
Temp -0.33670 0.01776 -18.95 <2e-16 ***
InsulAfter -1.56520 0.09705 -16.13 <2e-16 ***
# same model without the intercept
summary(lm(Gas ~ Temp + Insul -1 , data=whiteside))
Estimate Std. Error t value Pr(>|t|)
Temp -0.33670 0.01776 -18.95 <2e-16 ***
InsulBefore 6.55133 0.11809 55.48 <2e-16 ***
InsulAfter 4.98612 0.10268 48.56 <2e-16 ***
In presenting the terms of this model, I would like to be able to talk about
the physical meaning of the coefficients. Minus the intercept term, the
second example presents the slopes before and after. I now understand how to
interpret the first example, as:
Intercept - 1.56520 = InsulAfter
6.55133 - 1.56520 = 4.98612
... the meaning of the coeficient for InsulAfter is relative to that of the
Intercept (treatment style contrasts).
With the above example the significance terms do no really change when the
intercept is removed. Within the context of my data, removing the intercept
has the following effect:
# with intercept
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.296e+00 1.404e+00 5.196 2.03e-07 ***
bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 ***
clastic_volcanic -1.396e-01 8.209e-01 -0.170 0.86495
coarse_sedimentary -2.720e+00 8.634e-01 -3.150 0.00163 **
felsic_intrusive -3.862e-01 8.404e-01 -0.460 0.64583
fine_sedimentary -1.010e+00 8.795e-01 -1.149 0.25069
rhyolite -8.420e-01 8.531e-01 -0.987 0.32365
tuff 1.569e+01 1.008e+03 0.016 0.98758
# without intercept:
Estimate Std. Error z value Pr(>|z|)
bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 ***
andesite 7.296e+00 1.404e+00 5.196 2.03e-07 ***
clastic_volcanic 7.156e+00 1.276e+00 5.608 2.05e-08 ***
coarse_sedimentary 4.576e+00 1.158e+00 3.951 7.77e-05 ***
felsic_intrusive 6.910e+00 1.252e+00 5.520 3.39e-08 ***
fine_sedimentary 6.286e+00 1.279e+00 4.916 8.83e-07 ***
rhyolite 6.454e+00 1.289e+00 5.005 5.57e-07 ***
tuff 2.299e+01 1.008e+03 0.023 0.982
... the meaning of the coefficients now makes sense (I think!), however the
significance of each level of the 'stratifying' factor has changed
considerably. How can I interpret that change?
Thanks,
Dylan
> -p
>
> > Much of what I am trying to get across is that you have a lot of choice
> > as to how you specify a model to R. There has to be a default, which is
> > chosen because it is often a good choice. It does rely on factors being
> > coded well: the 'base level' (to quote ?contr.treatment) needs to be
> > interpretable. And also bear in mind that the default choices of
> > statistical software in this area almost all differ (and R's differs from
> > S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good
> > choice' do differ.
> >
> >> Setting up the model, minus the intercept term, gives me what appear to
> >> be more meaningful coefficients. However, I am not sure if I am
> >> interpreting the results from a linear model without an intercept term.
> >> Model predictions from both specifications (with and without an
> >> intercept term) are nearly identical (different by about 1E-16 in
> >> probability space).
> >>
> >> Are there any gotchas to look out for when removing the intercept term
> >> from such a model?
> >
> > It is just a different parametrization of the linear predictor.
> > Anything interpretable in terms of the predictions of the model will be
> > unchanged. That is the crux: the default coefficients of 'b' will be
> > log odds-ratios that are directly interpretable, and those in the
> > per-group coding will be log-odds for a zero value of 'a'. Does a zero
> > value of 'a' make sense?
> >
> >> Any guidance would be greatly appreciated.
> >>
> >> Cheers,
--
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341
More information about the R-help
mailing list