[R] Saturated model in binomial glm

Giovanni Petris gpetris at uark.edu
Thu Feb 17 16:03:37 CET 2011


Dear Bill,

Thank you very much for your careful discussion of the issue.

It is not surprising that the deviance is the same whether you fit the
model using a factor response with weights or individual 0/1 responses.
I think this happens because the fitted probabilities in the saturated
models are zero or one (depending on the response being a failure or a
success) in both cases. The degrees of freedom are clearly different.

However, if you consider as response the two-column matrix of successes
and failures then you have a different deviance, with different degrees
of freedom:

> covariates <- subset(UCBAd, subset = Admit == "Admitted",
+                      select = -c(Admit, Freq))
> AdmitReject <- as.matrix(cbind(subset(UCBAd, subset = Admit == "Rejected",
+                                       select = Freq),
+                                subset(UCBAd, subset = Admit == "Admitted",
+                                       select = Freq)))
> UCBAd_matrix <- glm(AdmitReject ~ Gender + Dept, family = binomial,
+                     data = covariates) ## matrix as response
> UCBAd_matrix$deviance
[1] 20.20428
> UCBAd_matrix$df.residual
[1] 5
> coef(UCBAd_matrix)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE 
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574 
       DeptF 
  3.30648006 

In this case the fitted probabilities in the saturated model are the
observed proportions. The fitted coefficients are clearly the same.

The deviance (and degrees of freedom) resulting from this way of fitting
the model _can_ be used as a goodness-of-fit statistic, based on
standard arguments, while the deviance (and degrees of freedom)
resulting from fitting the model as individual 0/1 responses _can't_.

What about deviance and degrees of freedom that I get using the "factor
with weights" approach? The number of parameters in the saturated model
does not increase with the number of observations, as it happens in the
individual 0/1 responses case, but the fitted probabilities in the
saturated model are fixed to zero and one. I guess it is not clear to me
that the standard likelihood ratio test asymptotic theory holds in this
case. Could anybody clarify?

Thank you in advance,
Giovanni


On Thu, 2011-02-17 at 10:50 +1100, Bill.Venables at csiro.au wrote:
> This is a very good question.  You have spotted something that not
> many people see and it is important.
> 
> The bland assertion that the "deviance can be used as a test of fit"
> can be seriously misleading.
> 
> For this data the response is clearly binary, "Admitted" (success) or
> "Rejected" (failure) and the other two factors are explanatory
> variables.  
> 
> Any binomial model can be fitted by expressing the data as a binary
> response.  In this case there are
> 
> > sum(UCBAd$Freq)
> [1] 4526
> > 
> 
> 4526 trials, corresponding to the individual applicants for admission.
> We can expand the data frame right out to this level and fit the model
> with the data in that form, and in this case the weights will be the
> default, ie. all 1's.
> 
> We can also *group* the data into subsets which are homogeneous with
> respect to the explanatory variables.  
> 
> The most succinct grouping would be into 12 groups corresponding to
> the 2 x 6 distinct classes defined by the two explanatory variables.
> In this case you would specify the response either as a two-column
> matrix of successes/failures, or as a proportion with the totals for
> each of the 12 cases as weights.
> 
> Another succince grouping is into 2 x 2 x 6 classes as you do in your
> example.  In this case your response is the factor and the weights are
> the frequencies.
> 
> For all three cases a) the estimates will be the same, and so the
> predictions will be identical, b) the deviance will also be the same,
> but c) the degrees of freedom attributed to the deviance will be
> different.
> 
> The reason for c) is, as you have intuited, the saturated model is
> different.  Literally, the saturated model is a model with one mean
> parameter for each value taken as an observation when the model is
> fitted.  So the saturated model is *not* invariant with respect to
> grouping.
> 
> Let's look at two of these cases computationally:
> 
> 
> > UCB_Expanded <- UCBAd[rep(1:24, UCBAd$Freq), 1:3] ## expand the data frame
> > dim(UCB_Expanded)
> [1] 4526    3
> 
> >  ### now fit your model
> 
> > m1 <- glm(Admit ~ Gender + Dept, binomial, UCBAd, weights = Freq)
> 
> >  ### and the same model using the binary data form
> 
> > m2 <- glm(Admit ~ Gender + Dept, binomial, UCB_Expanded)
> 
> >  ### as predicted, the coefficients are identical (up to round off)
> 
> > coef(m1)
>  (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE 
>  -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574 
>        DeptF 
>   3.30648006 
> > coef(m2)
>  (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE 
>  -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574 
>        DeptF 
>   3.30648006 
> 
> >   ### and so are the deviances, perhaps surprisingly:
> 
> > deviance(m1)
> [1] 5187.488
> > deviance(m2)
> [1] 5187.488
> 
> >   ### but the degrees of freedom attributed to the deviance are different!
> 
> > m1$df.resid
> [1] 17
> > m2$df.resid
> [1] 4519
> > 
>  
> If you were to fit the model in the most succinct form, with 12
> relative frequencies, then you would get the same deviance again, but
> the degrees of freedom would be only 5.
> 
> So you need to be very careful in taking the deviance, even in
> binomial models, as a test of fit.  The way the data are grouped is
> relevant.
> 
> If you have two fixed models, e.g. Admit ~ Gender, and Admit ~ Gender + Dept, then
> 
> The estimated coefficients, and their standard errors, vcov matrix,
> The deviance, and so
> *Differences* in deviances and
> *Differences* in degrees of freedom
> 
> will be the same however the data are grouped, and so the usual tests
> and CI processes go ahead fine.
> 
> But the deviance itself can be misleading as a test of fit, since the
> outer hypothesis, the saturated model, is not fixed and depends on
> grouping.  For the ungrouped binary case it is *usually* misleading
> when taken simply at face value as chi-squared distributed under the
> null hypothesis.
> 
> I think there is a book or two around that discusses this issue, but
> probably not well enough.
> 
> Bill Venables.
> 
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Giovanni Petris
> Sent: Thursday, 17 February 2011 7:58 AM
> To: r-help at r-project.org
> Subject: [R] Saturated model in binomial glm
> 
> 
> Hi all,
> 
> Could somebody be so kind to explain to me what is the saturated model
> on which deviance and degrees of freedom are calculated when fitting a
> binomial glm?
> 
> Everything makes sense if I fit the model using as response a vector of
> proportions or a two-column matrix. But when the response is a factor
> and counts are specified via the "weights" argument, I am kind of lost
> as far as what is the implied saturated model. 
> 
> Here is a simple example, based on the UCBAdmissions data.
> 
> > UCBAd <- as.data.frame(UCBAdmissions)
> > UCBAd <- glm(Admit ~ Gender + Dept, family = binomial,
> +                         weights = Freq, data = UCBAd)
> > UCBAd$deviance
> [1] 5187.488
> > UCBAd$df.residual
> [1] 17
> 
> I can see that the data frame UCBAd has 24 rows and using 1+1+5
> parameters to fit the model leaves me with 17 degrees of freedom. 
> 
> What is not clear to me is what is the saturated model? 
> 
> Is it the model that fits a probability zero to each row corresponding
> to failures and a probability one to each row corresponding to
> successes? If this is so, it seems to me that looking at the deviance as
> a goodness-of-fit statistic does not make much sense in this case. Am I
> missing something?
> 
> Thank you in advance,
> Giovanni
> 
>



More information about the R-help mailing list