[R] Saturated model in binomial glm

Thu Feb 17 00:50:52 CET 2011

This is a very good question.  You have spotted something that not many people see and it is important.

The bland assertion that the "deviance can be used as a test of fit" can be seriously misleading.

For this data the response is clearly binary, "Admitted" (success) or "Rejected" (failure) and the other two factors are explanatory variables.  

Any binomial model can be fitted by expressing the data as a binary response.  In this case there are

> sum(UCBAd$Freq)
[1] 4526
> 

4526 trials, corresponding to the individual applicants for admission.  We can expand the data frame right out to this level and fit the model with the data in that form, and in this case the weights will be the default, ie. all 1's.

We can also *group* the data into subsets which are homogeneous with respect to the explanatory variables.  

The most succinct grouping would be into 12 groups corresponding to the 2 x 6 distinct classes defined by the two explanatory variables.  In this case you would specify the response either as a two-column matrix of successes/failures, or as a proportion with the totals for each of the 12 cases as weights.

Another succince grouping is into 2 x 2 x 6 classes as you do in your example.  In this case your response is the factor and the weights are the frequencies.

For all three cases a) the estimates will be the same, and so the predictions will be identical, b) the deviance will also be the same, but c) the degrees of freedom attributed to the deviance will be different.

The reason for c) is, as you have intuited, the saturated model is different.  Literally, the saturated model is a model with one mean parameter for each value taken as an observation when the model is fitted.  So the saturated model is *not* invariant with respect to grouping.

Let's look at two of these cases computationally:

> UCB_Expanded <- UCBAd[rep(1:24, UCBAd$Freq), 1:3] ## expand the data frame
> dim(UCB_Expanded)
[1] 4526    3

>  ### now fit your model

> m1 <- glm(Admit ~ Gender + Dept, binomial, UCBAd, weights = Freq)

>  ### and the same model using the binary data form

> m2 <- glm(Admit ~ Gender + Dept, binomial, UCB_Expanded)

>  ### as predicted, the coefficients are identical (up to round off)

> coef(m1)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE 
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574 
       DeptF 
  3.30648006 
> coef(m2)
 (Intercept) GenderFemale        DeptB        DeptC        DeptD        DeptE 
 -0.58205140  -0.09987009   0.04339793   1.26259802   1.29460647   1.73930574 
       DeptF 
  3.30648006 

>   ### and so are the deviances, perhaps surprisingly:

> deviance(m1)
[1] 5187.488
> deviance(m2)
[1] 5187.488

>   ### but the degrees of freedom attributed to the deviance are different!

> m1$df.resid
[1] 17
> m2$df.resid
[1] 4519
> 

If you were to fit the model in the most succinct form, with 12 relative frequencies, then you would get the same deviance again, but the degrees of freedom would be only 5.

So you need to be very careful in taking the deviance, even in binomial models, as a test of fit.  The way the data are grouped is relevant.

If you have two fixed models, e.g. Admit ~ Gender, and Admit ~ Gender + Dept, then

The estimated coefficients, and their standard errors, vcov matrix,
The deviance, and so
*Differences* in deviances and
*Differences* in degrees of freedom

will be the same however the data are grouped, and so the usual tests and CI processes go ahead fine.

But the deviance itself can be misleading as a test of fit, since the outer hypothesis, the saturated model, is not fixed and depends on grouping.  For the ungrouped binary case it is *usually* misleading when taken simply at face value as chi-squared distributed under the null hypothesis.

I think there is a book or two around that discusses this issue, but probably not well enough.

Bill Venables.

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Giovanni Petris
Sent: Thursday, 17 February 2011 7:58 AM
To: r-help at r-project.org
Subject: [R] Saturated model in binomial glm

Hi all,

Could somebody be so kind to explain to me what is the saturated model
on which deviance and degrees of freedom are calculated when fitting a
binomial glm?

Everything makes sense if I fit the model using as response a vector of
proportions or a two-column matrix. But when the response is a factor
and counts are specified via the "weights" argument, I am kind of lost
as far as what is the implied saturated model. 

Here is a simple example, based on the UCBAdmissions data.

> UCBAd <- as.data.frame(UCBAdmissions)
> UCBAd <- glm(Admit ~ Gender + Dept, family = binomial,
+                         weights = Freq, data = UCBAd)
> UCBAd$deviance
[1] 5187.488
> UCBAd$df.residual
[1] 17

I can see that the data frame UCBAd has 24 rows and using 1+1+5
parameters to fit the model leaves me with 17 degrees of freedom. 

What is not clear to me is what is the saturated model? 

Is it the model that fits a probability zero to each row corresponding
to failures and a probability one to each row corresponding to
successes? If this is so, it seems to me that looking at the deviance as
a goodness-of-fit statistic does not make much sense in this case. Am I
missing something?

Thank you in advance,
Giovanni

-- 

Giovanni Petris  <GPetris at uark.edu>
Associate Professor
Department of Mathematical Sciences
University of Arkansas - Fayetteville, AR 72701
Ph: (479) 575-6324, 575-8630 (fax)
http://definetti.uark.edu/~gpetris/

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.