[R] logistic regression model specification
Dylan Beaudette
dylan.beaudette at gmail.com
Wed Nov 14 18:34:28 CET 2007
On Nov 14, 2007 5:17 AM, Peter Dalgaard <P.Dalgaard at biostat.ku.dk> wrote:
>
> Dylan Beaudette wrote:
> > On Tuesday 13 November 2007, Peter Dalgaard wrote:
> >
> >> Prof Brian Ripley wrote:
> >>
> >>> On Tue, 13 Nov 2007, Dylan Beaudette wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I have setup a simple logistic regression model with the glm() function,
> >>>> with the follow formula:
> >>>>
> >>>> y ~ a + b
> >>>>
> >>>> where:
> >>>> 'a' is a continuous variable stratified by
> >>>> the levels of 'b'
> >>>>
> >>>>
> >>>> Looking over the manual for model specification, it seems that
> >>>> coefficients for unordered factors are given 'against' the first level
> >>>> of that factor.
> >>>>
> >>> Only for the default coding.
> >>>
> >>>
> >>>> This makes for difficult interpretation when using factor 'b' as a
> >>>> stratifying model term.
> >>>>
> >>> Really? You realize that you have not 'stratified' on 'b', which would
> >>> need the model to be a*b? What you have is a model with parallel linear
> >>> predictors for different levels of 'b', and if the coefficients are not
> >>> telling you what you want you should change the coding.
> >>>
> >
> > Thanks for the comments Peter. Note that use/abuse of jargon on my part is
> > augmented by my limited understanding.
> >
> >
> >> I have to differ slightly here. "Stratification", at least in the fields
> >> that I connect with, usually means that you combine information from
> >> several groups via an assumption that they have a common value of a
> >> parameter, which in the present case is essentially the same as assuming
> >> an additive model y~a+b.
> >>
> >
> > This was the definition of 'stratification' that I was using when formulating
> > this model.
> >
> >
> >> I share your confusion as to why the parametrization of the effects of
> >> factor b should matter, though. Surely, the original poster has already
> >> noticed that the estimated effect of a is the same whether or not the
> >> intercept is included? The only difference I see is that the running
> >> anova() or drop1() would not give interesting results for the effect of
> >> b in the no-intercept variation.
> >>
> >
> > Yes, I have noticed that the estimated effect is the same. It looks like I am
> > having trouble interpreting the model formula and coefficients. An example
> > from MASS, using the default R contrasts:
> >
> > library(MASS)
> > data(whiteside)
> >
> > # simple additive model, relative to first level of 'Insul'
> > # 'parallel regressions'
> > summary(lm(Gas ~ Temp + Insul, data=whiteside))
> >
> > Estimate Std. Error t value Pr(>|t|)
> > (Intercept) 6.55133 0.11809 55.48 <2e-16 ***
> > Temp -0.33670 0.01776 -18.95 <2e-16 ***
> > InsulAfter -1.56520 0.09705 -16.13 <2e-16 ***
> >
> >
> > # same model without the intercept
> > summary(lm(Gas ~ Temp + Insul -1 , data=whiteside))
> >
> > Estimate Std. Error t value Pr(>|t|)
> > Temp -0.33670 0.01776 -18.95 <2e-16 ***
> > InsulBefore 6.55133 0.11809 55.48 <2e-16 ***
> > InsulAfter 4.98612 0.10268 48.56 <2e-16 ***
> >
> > In presenting the terms of this model, I would like to be able to talk about
> > the physical meaning of the coefficients. Minus the intercept term, the
> > second example presents the slopes before and after. I now understand how to
> > interpret the first example, as:
> >
> > Intercept - 1.56520 = InsulAfter
> > 6.55133 - 1.56520 = 4.98612
> >
> > ... the meaning of the coeficient for InsulAfter is relative to that of the
> > Intercept (treatment style contrasts).
> >
> > With the above example the significance terms do no really change when the
> > intercept is removed. Within the context of my data, removing the intercept
> > has the following effect:
> >
> > # with intercept
> > Estimate Std. Error z value Pr(>|z|)
> > (Intercept) 7.296e+00 1.404e+00 5.196 2.03e-07 ***
> > bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 ***
> > clastic_volcanic -1.396e-01 8.209e-01 -0.170 0.86495
> > coarse_sedimentary -2.720e+00 8.634e-01 -3.150 0.00163 **
> > felsic_intrusive -3.862e-01 8.404e-01 -0.460 0.64583
> > fine_sedimentary -1.010e+00 8.795e-01 -1.149 0.25069
> > rhyolite -8.420e-01 8.531e-01 -0.987 0.32365
> > tuff 1.569e+01 1.008e+03 0.016 0.98758
> >
> > # without intercept:
> > Estimate Std. Error z value Pr(>|z|)
> > bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 ***
> > andesite 7.296e+00 1.404e+00 5.196 2.03e-07 ***
> > clastic_volcanic 7.156e+00 1.276e+00 5.608 2.05e-08 ***
> > coarse_sedimentary 4.576e+00 1.158e+00 3.951 7.77e-05 ***
> > felsic_intrusive 6.910e+00 1.252e+00 5.520 3.39e-08 ***
> > fine_sedimentary 6.286e+00 1.279e+00 4.916 8.83e-07 ***
> > rhyolite 6.454e+00 1.289e+00 5.005 5.57e-07 ***
> > tuff 2.299e+01 1.008e+03 0.023 0.982
> >
> >
> > ... the meaning of the coefficients now makes sense (I think!), however the
> > significance of each level of the 'stratifying' factor has changed
> > considerably. How can I interpret that change?
> >
> >
> >
Peter:
> (This is not raw R output, is it? The name of the stratifying factor
> seems to be missing.)
>
Sorry about that, it was copied right out of R, however I removed the
stratifying factor prefix for brevity.
> If you add the Intercept to each term in the first analysis, as well as
> to zero for andesite, you get the numbers in the second analysis, or
> conversely subtract andesite from each term in in the 2nd analysis and
> get those in the first.
This is now clear to me. It took writing out both forms in the last
email to fully realize that. (I am a soil scientist by trade... trying
to catch-up in stats)
> In the first display, you can immediately see that (the intercept of)
> the line for coarse_sedimentary is significantly lower than that for
> andesite. Since the regression coefficient on bd_sum is assumed the
> same for all strata, the difference in intercepts is actually the
> difference for any fixed value of bd_sum. The other factor levels are
> not significantly different from andesite.
Ok-- this was what I was really struggling with here- interpreting the
marginal t-tests. I was interpreting the marginal t-tests as tests of
significance of the levels of my stratifying factor as predictors of
my response. I will read up on the output from summary.glm some
more...
> In the second display, the significance tests are for whether the
> intercept for each per-stratum line can be zero. I.e. whether the value
> of the linear predictor is zero when bd_sum is zero, which in turn,
> since this is a logistic model, implies that the predicted probability
> is 0.5. This is probably not a relevant hypothesis at all, but all
> intercepts except the last one are significantly non-zero. "Tuff"
> appears to be anomalous; it has a large coefficient and a huge s.e. --
> presumably the response is all 1's in this group?
Yikes. You are correct, this is not a hypothesis worth testing in this
case. The 'Tuff' case is as you suggested- all response observations
were 1's .
Some follow-up:
say we are using the same model:
logit(y) ~ temp + geology
the logit of the response is based on the influence of temperature,
and the underlying geologic type. The influence of temperature should
be the same everywhere, however the geologic type alters several other
physical parameters which cannot be directly modeled. With the above
advice I think that I can now interpret the meaning of the marginal
t-tests. However, I am not really interested how a certain geologic
type's coefficient is different than that of the first level of that
factor. Would another model form help me with that--
# regression on each level of stratifying factor
# explicit removal of the intercept as suggested in MASS, ch. 6
logit(y) ~ geology/temp - 1
.. the logit of the response is based on the influence of temperature,
which is different (slope and intercept are different) across each
level of the geology. The output from that linear model gives (I
think) an intercept and slope for each level of my stratifying factor;
a single regression on temp for every level of geology (?).
Interpretation of the model parameters would potentially be more
meaningful in my context. Here is the output, this time without any
deletions:
glm(formula = mollic ~ geo_class2/bd_sum - 1, family = binomial(),
data = xy[-crit.points.idx, ], subset = parent_material !=
"ALL" & parent_material != "OBD" & dem_slope >= slope_cutoff)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9818343 -0.7562096 0.0003575 0.8130413 2.1977756
Coefficients:
Estimate Std. Error z value Pr(>|z|)
geo_class2andesite 1.146e+01 6.920e+00 1.656 0.09781 .
geo_class2clastic_volcanic 3.841e+00 2.233e+00 1.720 0.08539 .
geo_class2coarse_sedimentary 4.371e+00 2.390e+00 1.829 0.06743 .
geo_class2felsic_intrusive 5.384e+00 2.714e+00 1.984 0.04725 *
geo_class2fine_sedimentary 8.527e+00 4.013e+00 2.125 0.03361 *
geo_class2rhyolite 1.075e+01 4.071e+00 2.641 0.00828 **
geo_class2tuff 1.657e+01 9.116e+03 0.002 0.99855
geo_class2andesite:bd_sum -1.356e-03 8.448e-04 -1.605 0.10858
geo_class2clastic_volcanic:bd_sum -4.138e-04 2.831e-04 -1.462 0.14383
geo_class2coarse_sedimentary:bd_sum -8.041e-04 3.338e-04 -2.409 0.01601 *
geo_class2felsic_intrusive:bd_sum -6.352e-04 3.503e-04 -1.813 0.06977 .
geo_class2fine_sedimentary:bd_sum -1.122e-03 5.139e-04 -2.184 0.02897 *
geo_class2rhyolite:bd_sum -1.361e-03 4.873e-04 -2.793 0.00522 **
geo_class2tuff:bd_sum 8.793e-15 1.200e+00 7.33e-15 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 274.49 on 198 degrees of freedom
Residual deviance: 190.03 on 184 degrees of freedom
AIC: 218.03
... how would I interpret the marginal t-tests in this case? Would
each be a test of how well the intercept/slope pair are predicting my
response?
A quick comparison of the two models (again, as suggested in MASS)
with the anova() function:
Analysis of Deviance Table
Model 1: mollic ~ bd_sum + geo_class2
Model 2: mollic ~ geo_class2/bd_sum - 1
Resid. Df Resid. Dev Df Deviance
1 190 194.801
2 184 190.033 6 4.769
... it does not appear that the second model is any different - is
this the correct interpretation? The fitted values are also very
close, but not the same.
Thanks again,
Dylan
>
> > Thanks,
> >
> > Dylan
> >
> >
> >
> >> -p
> >>
> >>
> >>> Much of what I am trying to get across is that you have a lot of choice
> >>> as to how you specify a model to R. There has to be a default, which is
> >>> chosen because it is often a good choice. It does rely on factors being
> >>> coded well: the 'base level' (to quote ?contr.treatment) needs to be
> >>> interpretable. And also bear in mind that the default choices of
> >>> statistical software in this area almost all differ (and R's differs from
> >>> S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good
> >>> choice' do differ.
> >>>
> >>>
> >>>> Setting up the model, minus the intercept term, gives me what appear to
> >>>> be more meaningful coefficients. However, I am not sure if I am
> >>>> interpreting the results from a linear model without an intercept term.
> >>>> Model predictions from both specifications (with and without an
> >>>> intercept term) are nearly identical (different by about 1E-16 in
> >>>> probability space).
> >>>>
> >>>> Are there any gotchas to look out for when removing the intercept term
> >>>> from such a model?
> >>>>
> >>> It is just a different parametrization of the linear predictor.
> >>> Anything interpretable in terms of the predictions of the model will be
> >>> unchanged. That is the crux: the default coefficients of 'b' will be
> >>> log odds-ratios that are directly interpretable, and those in the
> >>> per-group coding will be log-odds for a zero value of 'a'. Does a zero
> >>> value of 'a' make sense?
> >>>
> >>>
> >>>> Any guidance would be greatly appreciated.
> >>>>
> >>>> Cheers,
> >>>>
> >
> >
> >
> >
>
>
> --
>
> O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
>
>
>
More information about the R-help
mailing list