[R] logistic regression model specification

Wed Nov 14 01:17:45 CET 2007

On Tuesday 13 November 2007, Peter Dalgaard wrote:
> Prof Brian Ripley wrote:
> > On Tue, 13 Nov 2007, Dylan Beaudette wrote:
> >> Hi,
> >>
> >> I have setup a simple logistic regression model with the glm() function,
> >> with the follow formula:
> >>
> >> y ~ a + b
> >>
> >> where:
> >> 'a' is a continuous variable stratified by
> >> the levels of 'b'
> >>
> >>
> >> Looking over the manual for model specification, it seems that
> >> coefficients for unordered factors are given 'against' the first level
> >> of that factor.
> >
> > Only for the default coding.
> >
> >> This makes for difficult interpretation when using factor 'b' as a
> >> stratifying model term.
> >
> > Really?  You realize that you have not 'stratified' on 'b', which would
> > need the model to be a*b?  What you have is a model with parallel linear
> > predictors for different levels of 'b', and if the coefficients are not
> > telling you what you want you should change the coding.

Thanks for the comments Peter. Note that use/abuse of jargon on my part is 
augmented by my limited understanding.

> I have to differ slightly here. "Stratification", at least in the fields
> that I connect with, usually means that you combine information from
> several groups via an assumption that they have a common value of a
> parameter, which in the present case is essentially the same as assuming
> an additive model y~a+b.

This was the definition of 'stratification' that I was using when formulating 
this model.

> I share your confusion as to why the parametrization of the effects of
> factor b should matter, though. Surely, the original poster has already
> noticed that the estimated effect of a is the same whether or not the
> intercept is included? The only difference I see is that the running
> anova() or drop1() would not give interesting results for the effect of
> b in the no-intercept variation.

Yes, I have noticed that the estimated effect is the same. It looks like I am 
having trouble interpreting the model formula and coefficients. An example 
from MASS, using the default R contrasts:

library(MASS)
data(whiteside)

# simple additive model, relative to first level of 'Insul'
# 'parallel regressions'
summary(lm(Gas ~ Temp + Insul, data=whiteside))

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.55133    0.11809   55.48   <2e-16 ***
Temp        -0.33670    0.01776  -18.95   <2e-16 ***
InsulAfter  -1.56520    0.09705  -16.13   <2e-16 ***

# same model without the intercept
summary(lm(Gas ~ Temp + Insul -1 , data=whiteside))

            Estimate Std. Error t value Pr(>|t|)    
Temp        -0.33670    0.01776  -18.95   <2e-16 ***
InsulBefore  6.55133    0.11809   55.48   <2e-16 ***
InsulAfter   4.98612    0.10268   48.56   <2e-16 ***

In presenting the terms of this model, I would like to be able to talk about 
the physical meaning of the coefficients. Minus the intercept term, the 
second example presents the slopes before and after. I now understand how to 
interpret the first example, as:

Intercept - 1.56520 = InsulAfter
6.55133 - 1.56520 = 4.98612

... the meaning of the coeficient for InsulAfter is relative to that of the 
Intercept (treatment style contrasts).

With the above example the significance terms do no really change when the 
intercept is removed. Within the context of my data, removing the intercept 
has the following effect:

# with intercept
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   7.296e+00  1.404e+00   5.196 2.03e-07 ***
bd_sum                       -8.331e-04  1.533e-04  -5.433 5.55e-08 ***
clastic_volcanic   -1.396e-01  8.209e-01  -0.170  0.86495    
coarse_sedimentary -2.720e+00  8.634e-01  -3.150  0.00163 ** 
felsic_intrusive   -3.862e-01  8.404e-01  -0.460  0.64583    
fine_sedimentary   -1.010e+00  8.795e-01  -1.149  0.25069    
rhyolite           -8.420e-01  8.531e-01  -0.987  0.32365    
tuff                1.569e+01  1.008e+03   0.016  0.98758

# without intercept:
                               Estimate Std. Error z value Pr(>|z|)    
bd_sum                       -8.331e-04  1.533e-04  -5.433 5.55e-08 ***
andesite            7.296e+00  1.404e+00   5.196 2.03e-07 ***
clastic_volcanic    7.156e+00  1.276e+00   5.608 2.05e-08 ***
coarse_sedimentary  4.576e+00  1.158e+00   3.951 7.77e-05 ***
felsic_intrusive    6.910e+00  1.252e+00   5.520 3.39e-08 ***
fine_sedimentary    6.286e+00  1.279e+00   4.916 8.83e-07 ***
rhyolite            6.454e+00  1.289e+00   5.005 5.57e-07 ***
tuff                2.299e+01  1.008e+03   0.023    0.982

... the meaning of the coefficients now makes sense (I think!), however the 
significance of each level of the 'stratifying' factor has changed 
considerably. How can I interpret that change?

Thanks,

Dylan

>     -p
>
> > Much of what I am trying to get across is that you have a lot of choice
> > as to how you specify a model to R. There has to be a default, which is
> > chosen because it is often a good choice.  It does rely on factors being
> > coded well: the 'base level' (to quote ?contr.treatment) needs to be
> > interpretable.  And also bear in mind that the default choices of
> > statistical software in this area almost all differ (and R's differs from
> > S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good
> > choice' do differ.
> >
> >> Setting up the model, minus the intercept term, gives me what appear to
> >> be more meaningful coefficients. However, I am not sure if I am
> >> interpreting the results from a linear model without an intercept term.
> >> Model predictions from both specifications (with and without an
> >> intercept term) are nearly identical (different by about 1E-16 in
> >> probability space).
> >>
> >> Are there any gotchas to look out for when removing the intercept term
> >> from such a model?
> >
> > It is just a different parametrization of the linear predictor.
> > Anything interpretable in terms of the predictions of the model will be
> > unchanged.  That is the crux: the default coefficients of 'b' will be
> > log odds-ratios that are directly interpretable, and those in the
> > per-group coding will be log-odds for a zero value of 'a'. Does a zero
> > value of 'a' make sense?
> >
> >> Any guidance would be greatly appreciated.
> >>
> >> Cheers,

-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341