[R] logistic regression on contingency table
Marc Schwartz
marc_schwartz at comcast.net
Mon Mar 5 21:00:36 CET 2007
On Mon, 2007-03-05 at 15:31 +0000, Dieter Menne wrote:
> Bingshan Li <bli1 <at> bcm.tmc.edu> writes:
>
> > I am wondering if there is a way in R to fit logistic regression on
> > contingency table. If I have original data, I can transform the data
> > into a design matrix and then call glm to fit the regression. But now
> > I have a 2x3 contingency table with first row for response 0 and
> > second row for response 1, and the columns are 3 levels of predictor
> > variable. The 3 levels are not ordinal though and indicator variables
> > would be more appreciate.
>
> >From Documentation of GLM:
>
> For binomial and quasibinomial families the response can also be specified
> as a factor (when the first level denotes failure and all others success)
> or as a two-column matrix with the columns giving the numbers of successes
> and failures.
> ....
>
> Dieter Menne
Just to expand on Dieter's comments, one trick to taking this approach
is to coerce the contingency table you are starting with to a data
frame, and then specify a 'weights' argument to glm().
Taking some dummy example data in a 2D contingency table:
> TAB
X
Y A B C
0 1 9 2
1 3 3 2
So we have X (IV) and Y (Response).
Now, coerce TAB to a data frame. See ?as.data.frame.table and ?xtabs,
which reverses the process back to a contingency table:
DFT <- as.data.frame(TAB)
> DFT
Y X Freq
1 0 A 1
2 1 A 3
3 0 B 9
4 1 B 3
5 0 C 2
6 1 C 2
As an FYI, gets us back to 'TAB':
> xtabs(Freq ~ ., DFT)
X
Y A B C
0 1 9 2
1 3 3 2
Now create the model, using 'Freq' for the case weights:
fit <- glm(Y ~ X, weights = Freq, data = DFT, family = binomial)
> summary(fit)
Call:
glm(formula = Y ~ X, family = binomial, data = DFT, weights = Freq)
Deviance Residuals:
1 2 3 4 5 6
-1.665 1.314 -2.276 2.884 -1.665 1.665
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.099 1.155 0.951 0.3414
XB -2.197 1.333 -1.648 0.0994 .
XC -1.099 1.528 -0.719 0.4720
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26.920 on 5 degrees of freedom
Residual deviance: 23.540 on 3 degrees of freedom
AIC: 29.54
Number of Fisher Scoring iterations: 4
An alternative to the above is to use a function that I just posted in
the past week for another query, called expand.dft():
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ],
simplify = FALSE)
DF <- subset(do.call("rbind", DF), select = -Freq)
for (i in 1:ncol(DF))
{
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
DF
}
This takes the data frame table 'DFT' from above and converts it back to
the raw observations:
DF <- expand.dft(DFT)
> DF
Y X
1 0 A
2 1 A
2.1 1 A
2.2 1 A
3 0 B
3.1 0 B
3.2 0 B
3.3 0 B
3.4 0 B
3.5 0 B
3.6 0 B
3.7 0 B
3.8 0 B
4 1 B
4.1 1 B
4.2 1 B
5 0 C
5.1 0 C
6 1 C
6.1 1 C
As an FYI, gets us back to 'TAB':
> table(DF)
X
Y A B C
0 1 9 2
1 3 3 2
So, now we can use the normal approach for glm():
fit2 <- glm(Y ~ X, data = DF, family = binomial)
> summary(fit2)
Call:
glm(formula = Y ~ X, family = binomial, data = DF)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6651 -0.7585 -0.7585 0.8632 1.6651
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.099 1.155 0.951 0.3414
XB -2.197 1.333 -1.648 0.0994 .
XC -1.099 1.528 -0.719 0.4720
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 26.920 on 19 degrees of freedom
Residual deviance: 23.540 on 17 degrees of freedom
AIC: 29.54
Number of Fisher Scoring iterations: 4
Note of course that the DF's are different, though the Null and Residual
Deviances and AIC are the same:
Taking the latter approach, will of course enable you to make subsequent
manipulations on the raw data if you wish.
HTH,
Marc Schwartz
More information about the R-help
mailing list