# [R] logistic regression on contingency table

Marc Schwartz marc_schwartz at comcast.net
Mon Mar 5 21:00:36 CET 2007

```On Mon, 2007-03-05 at 15:31 +0000, Dieter Menne wrote:
> Bingshan Li <bli1 <at> bcm.tmc.edu> writes:
>
> > I am wondering if there is a way in R to fit logistic regression on
> > contingency table. If I have original data, I can transform the data
> > into a design matrix and then call glm to fit the regression. But now
> > I have a 2x3 contingency table with first row for response 0 and
> > second row for response 1, and the columns are 3 levels of predictor
> > variable. The 3 levels are not ordinal though and indicator variables
> > would be more appreciate.
>
> >From Documentation of GLM:
>
> For binomial and quasibinomial families the response can also be specified
> as a factor (when the first level denotes failure and all others success)
> or as a two-column matrix with the columns giving the numbers of successes
> and failures.
> ....
>
> Dieter Menne

Just to expand on Dieter's comments, one trick to taking this approach
is to coerce the contingency table you are starting with to a data
frame, and then specify a 'weights' argument to glm().

Taking some dummy example data in a 2D contingency table:

> TAB
X
Y   A B C
0 1 9 2
1 3 3 2

So we have X (IV) and Y (Response).

Now, coerce TAB to a data frame. See ?as.data.frame.table and ?xtabs,
which reverses the process back to a contingency table:

DFT <- as.data.frame(TAB)

> DFT
Y X Freq
1 0 A    1
2 1 A    3
3 0 B    9
4 1 B    3
5 0 C    2
6 1 C    2

As an FYI, gets us back to 'TAB':

> xtabs(Freq ~ ., DFT)
X
Y   A B C
0 1 9 2
1 3 3 2

Now create the model, using 'Freq' for the case weights:

fit <- glm(Y ~ X, weights = Freq, data = DFT, family = binomial)

> summary(fit)

Call:
glm(formula = Y ~ X, family = binomial, data = DFT, weights = Freq)

Deviance Residuals:
1       2       3       4       5       6
-1.665   1.314  -2.276   2.884  -1.665   1.665

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    1.099      1.155   0.951   0.3414
XB            -2.197      1.333  -1.648   0.0994 .
XC            -1.099      1.528  -0.719   0.4720
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 26.920  on 5  degrees of freedom
Residual deviance: 23.540  on 3  degrees of freedom
AIC: 29.54

Number of Fisher Scoring iterations: 4

An alternative to the above is to use a function that I just posted in
the past week for another query, called expand.dft():

expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x\$Freq[i]), ],
simplify = FALSE)

DF <- subset(do.call("rbind", DF), select = -Freq)

for (i in 1:ncol(DF))
{
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)

}

DF
}

This takes the data frame table 'DFT' from above and converts it back to
the raw observations:

DF <- expand.dft(DFT)

> DF
Y X
1   0 A
2   1 A
2.1 1 A
2.2 1 A
3   0 B
3.1 0 B
3.2 0 B
3.3 0 B
3.4 0 B
3.5 0 B
3.6 0 B
3.7 0 B
3.8 0 B
4   1 B
4.1 1 B
4.2 1 B
5   0 C
5.1 0 C
6   1 C
6.1 1 C

As an FYI, gets us back to 'TAB':

> table(DF)
X
Y   A B C
0 1 9 2
1 3 3 2

So, now we can use the normal approach for glm():

fit2 <- glm(Y ~ X, data = DF, family = binomial)

> summary(fit2)

Call:
glm(formula = Y ~ X, family = binomial, data = DF)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.6651  -0.7585  -0.7585   0.8632   1.6651

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)    1.099      1.155   0.951   0.3414
XB            -2.197      1.333  -1.648   0.0994 .
XC            -1.099      1.528  -0.719   0.4720
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 26.920  on 19  degrees of freedom
Residual deviance: 23.540  on 17  degrees of freedom
AIC: 29.54

Number of Fisher Scoring iterations: 4

Note of course that the DF's are different, though the Null and Residual
Deviances and AIC are the same:

Taking the latter approach, will of course enable you to make subsequent
manipulations on the raw data if you wish.

HTH,

Marc Schwartz

```