[R] logistic regression on contingency table

Mon Mar 5 21:00:36 CET 2007

On Mon, 2007-03-05 at 15:31 +0000, Dieter Menne wrote:
> Bingshan Li <bli1 <at> bcm.tmc.edu> writes:
> 
> > I am wondering if there is a way in R to fit logistic regression on  
> > contingency table. If I have original data, I can transform the data  
> > into a design matrix and then call glm to fit the regression. But now  
> > I have a 2x3 contingency table with first row for response 0 and  
> > second row for response 1, and the columns are 3 levels of predictor  
> > variable. The 3 levels are not ordinal though and indicator variables  
> > would be more appreciate.
> 
> >From Documentation of GLM:
> 
> For binomial and quasibinomial families the response can also be specified 
> as a factor (when the first level denotes failure and all others success) 
> or as a two-column matrix with the columns giving the numbers of successes 
> and failures.  
> ....
> 
> Dieter Menne

Just to expand on Dieter's comments, one trick to taking this approach
is to coerce the contingency table you are starting with to a data
frame, and then specify a 'weights' argument to glm().

Taking some dummy example data in a 2D contingency table:

> TAB
   X
Y   A B C
  0 1 9 2
  1 3 3 2

So we have X (IV) and Y (Response).

Now, coerce TAB to a data frame. See ?as.data.frame.table and ?xtabs,
which reverses the process back to a contingency table:

DFT <- as.data.frame(TAB)

> DFT
  Y X Freq
1 0 A    1
2 1 A    3
3 0 B    9
4 1 B    3
5 0 C    2
6 1 C    2

As an FYI, gets us back to 'TAB':

> xtabs(Freq ~ ., DFT)
   X
Y   A B C
  0 1 9 2
  1 3 3 2

Now create the model, using 'Freq' for the case weights:

fit <- glm(Y ~ X, weights = Freq, data = DFT, family = binomial)

> summary(fit)

Call:
glm(formula = Y ~ X, family = binomial, data = DFT, weights = Freq)

Deviance Residuals: 
     1       2       3       4       5       6  
-1.665   1.314  -2.276   2.884  -1.665   1.665  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    1.099      1.155   0.951   0.3414  
XB            -2.197      1.333  -1.648   0.0994 .
XC            -1.099      1.528  -0.719   0.4720  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26.920  on 5  degrees of freedom
Residual deviance: 23.540  on 3  degrees of freedom
AIC: 29.54

Number of Fisher Scoring iterations: 4

An alternative to the above is to use a function that I just posted in
the past week for another query, called expand.dft():

expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
  DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ],
               simplify = FALSE)

  DF <- subset(do.call("rbind", DF), select = -Freq)

  for (i in 1:ncol(DF))
  {
    DF[[i]] <- type.convert(as.character(DF[[i]]),
                            na.strings = na.strings,
                            as.is = as.is, dec = dec)

  }

  DF
}             

This takes the data frame table 'DFT' from above and converts it back to
the raw observations:

DF <- expand.dft(DFT)

> DF
    Y X
1   0 A
2   1 A
2.1 1 A
2.2 1 A
3   0 B
3.1 0 B
3.2 0 B
3.3 0 B
3.4 0 B
3.5 0 B
3.6 0 B
3.7 0 B
3.8 0 B
4   1 B
4.1 1 B
4.2 1 B
5   0 C
5.1 0 C
6   1 C
6.1 1 C

As an FYI, gets us back to 'TAB':

> table(DF)
   X
Y   A B C
  0 1 9 2
  1 3 3 2

So, now we can use the normal approach for glm():

fit2 <- glm(Y ~ X, data = DF, family = binomial)

> summary(fit2)

Call:
glm(formula = Y ~ X, family = binomial, data = DF)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6651  -0.7585  -0.7585   0.8632   1.6651  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    1.099      1.155   0.951   0.3414  
XB            -2.197      1.333  -1.648   0.0994 .
XC            -1.099      1.528  -0.719   0.4720  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26.920  on 19  degrees of freedom
Residual deviance: 23.540  on 17  degrees of freedom
AIC: 29.54

Number of Fisher Scoring iterations: 4

Note of course that the DF's are different, though the Null and Residual
Deviances and AIC are the same:

Taking the latter approach, will of course enable you to make subsequent
manipulations on the raw data if you wish.

HTH,

Marc Schwartz