[R] logistic regression model specification

Wed Nov 14 00:47:18 CET 2007

Prof Brian Ripley wrote:
> On Tue, 13 Nov 2007, Dylan Beaudette wrote:
>
>   
>> Hi,
>>
>> I have setup a simple logistic regression model with the glm() function, with
>> the follow formula:
>>
>> y ~ a + b
>>
>> where:
>> 'a' is a continuous variable stratified by
>> the levels of 'b'
>>
>>
>> Looking over the manual for model specification, it seems that coefficients
>> for unordered factors are given 'against' the first level of that factor.
>>     
>
> Only for the default coding.
>
>   
>> This makes for difficult interpretation when using factor 'b' as a
>> stratifying model term.
>>     
>
> Really?  You realize that you have not 'stratified' on 'b', which would 
> need the model to be a*b?  What you have is a model with parallel linear 
> predictors for different levels of 'b', and if the coefficients are not 
> telling you what you want you should change the coding.
>
>   
I have to differ slightly here. "Stratification", at least in the fields 
that I connect with, usually means that you combine information from 
several groups via an assumption that they have a common value of a 
parameter, which in the present case is essentially the same as assuming 
an additive model y~a+b.

I share your confusion as to why the parametrization of the effects of 
factor b should matter, though. Surely, the original poster has already 
noticed that the estimated effect of a is the same whether or not the 
intercept is included? The only difference I see is that the running 
anova() or drop1() would not give interesting results for the effect of 
b in the no-intercept variation.

    -p

> Much of what I am trying to get across is that you have a lot of choice as 
> to how you specify a model to R. There has to be a default, which is 
> chosen because it is often a good choice.  It does rely on factors being 
> coded well: the 'base level' (to quote ?contr.treatment) needs to be 
> interpretable.  And also bear in mind that the default choices of 
> statistical software in this area almost all differ (and R's differs from 
> S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good 
> choice' do differ.
>
>   
>> Setting up the model, minus the intercept term, gives me what appear to be
>> more meaningful coefficients. However, I am not sure if I am interpreting the
>> results from a linear model without an intercept term. Model predictions from
>> both specifications (with and without an intercept term) are nearly identical
>> (different by about 1E-16 in probability space).
>>
>> Are there any gotchas to look out for when removing the intercept term from
>> such a model?
>>     
>
> It is just a different parametrization of the linear predictor. 
> Anything interpretable in terms of the predictions of the model will be 
> unchanged.  That is the crux: the default coefficients of 'b' will be 
> log odds-ratios that are directly interpretable, and those in the 
> per-group coding will be log-odds for a zero value of 'a'. Does a zero 
> value of 'a' make sense?
>
>   
>> Any guidance would be greatly appreciated.
>>
>> Cheers,
>>
>>
>>     
>
>   

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907