[R] Woolf's test, Odds ratio, stratification

Marc Schwartz marc_schwartz at comcast.net
Sat Feb 24 23:44:35 CET 2007


On Sat, 2007-02-24 at 10:30 -0800, francogrex wrote:
> Just a general question concerning the woolf test (package vcd), when we have
> stratified data (2x2 tables) and  when the p.value of the woolf-test is
> below 0.05 then we assume that there is a heterogeneity and a common odds
> ratio cannot be computed?
> Does this mean that we have to try to add more stratification variables
> (stratify more) to make the woolf-test p.value insignificant? 
> Also in the same package there is an oddsratio function that seems to be
> calculating the odds ratio even when the values are non-integers and even
> when one cell has a null value (in contrast to the mantel_hanszel or the
> fisher exact test that do not admit zero values or non-integers) does anyone
> know what's the difference and how to optain a p.value on that odds ratio?
> Thanks

I'll defer comments on your 'vcd' function specific questions to the
package authors.

However, from a general perspective the Woolf Test, or for that matter,
the Breslow-Day Test (+/- Tarone Correction), are tests that are akin to
running tests for the homogeneity of variances prior to performing an
ANOVA. You are testing to see if the underlying assumptions of the
latter test have been violated. 

If the tests are "significant", it is not an _absolute_ contraindication
to continuing, but the result should give you pause to consider what
else is going on in your data. 

These tests are commonly applied, for example, in multi-center clinical
trials to test for the 'poolability' of the data from the sites into a
common data set for analysis.

In the case of the ANOVA, having markedly different variances suggests
that the differences in the means may not be the most 'interesting' of
findings in your data.

Similarly, with stratified 2x2 tables, you can perform a CMH test using
mantelhaen.test() and you will get a common odds ratio and CI's.
However, the common odds ratio may be at best, less interesting than
other underlying characteristics and at worst perhaps, misleading.

For an example, let's use the UCBAdmissions dataset and some of the
examples in ?mantelhaen.test. For specific information on the dataset,
see ?UCBAdmissions.  In fact, reading the Details on that help page and
running the examples, especially the series of 6 mosaicplot()s, should
provide additional insight into the findings I present below.

So, let's start:

First, let's charge ahead and run the CMH test on the data:

> mantelhaen.test(UCBAdmissions)

	Mantel-Haenszel chi-squared test with continuity correction

data:  UCBAdmissions 
Mantel-Haenszel X-squared = 1.4269, df = 1, p-value = 0.2323
alternative hypothesis: true common odds ratio is not equal to 1 
95 percent confidence interval:
 0.7719074 1.0603298 
sample estimates:
common odds ratio 
        0.9046968 


So, we can see that the resultant test statistic is not significant, yet
we do get a common OR and CI's.


Now, let's do the Woolf test, using the code provided
in ?mantelhaen.test:

> woolf(UCBAdmissions)
[1] 0.003427200


This suggests that the assumption of common odds ratios in the CMH test
is violated and indeed if we run:

> apply(UCBAdmissions, 3, function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]))
        A         B         C         D         E         F 
0.3492120 0.8025007 1.1330596 0.9212838 1.2216312 0.8278727 


We can see that the OR's range from 0.3492120 to 1.2216312.


So, let's take a look at the 'raw' data and do some further testing.  A
while ago, I wrote (and believe posted) code to take a data.frame.table
object and convert it back to it's raw structure as a non-summarized
data frame. Let's do that with UCBAdmissions, since UCBAdmissions is a
summarized 3D table:

> str(UCBAdmissions)
 table [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
 - attr(*, "dimnames")=List of 3
  ..$ Admit : chr [1:2] "Admitted" "Rejected"
  ..$ Gender: chr [1:2] "Male" "Female"
  ..$ Dept  : chr [1:6] "A" "B" "C" "D" ...


Here is the code:

expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".")
{
  DF <- sapply(1:nrow(x), function(i) x[rep(i, each = x$Freq[i]), ],
               simplify = FALSE)

  DF <- subset(do.call("rbind", DF), select = -Freq)

  for (i in 1:ncol(DF))
  {
    DF[[i]] <- type.convert(as.character(DF[[i]]),
                            na.strings = na.strings,
                            as.is = as.is, dec = dec)
                                           
  }
    
  DF
}             



So:

DF <- expand.dft(as.data.frame.table(UCBAdmissions))

> str(DF)
'data.frame':	4526 obs. of  3 variables:
 $ Admit : Factor w/ 2 levels "Admitted","Rejected": 1 1 1 1 1 1 1 1 1 1 ...
 $ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Dept  : Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...


Now, let's create logistic regression models on the raw data, first
using just the two covariates of Gender and Dept:

fit <- glm(Admit ~ Gender + Dept, family = binomial, data = DF)

> summary(fit)

Call:
glm(formula = Admit ~ Gender + Dept, family = binomial, data = DF)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3613  -0.9588   0.3741   0.9306   1.4773  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.68192    0.09911  -6.880 5.97e-12 ***
GenderMale   0.09987    0.08085   1.235    0.217    
DeptB        0.04340    0.10984   0.395    0.693    
DeptC        1.26260    0.10663  11.841  < 2e-16 ***
DeptD        1.29461    0.10582  12.234  < 2e-16 ***
DeptE        1.73931    0.12611  13.792  < 2e-16 ***
DeptF        3.30648    0.16998  19.452  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6044.3  on 4525  degrees of freedom
Residual deviance: 5187.5  on 4519  degrees of freedom
AIC: 5201.5

Number of Fisher Scoring iterations: 5




Now, for comparison, let's create the same model, but this time
including interaction terms for Gender*Dept:

fitIT <- glm(Admit ~ Gender * Dept, family = binomial, data = DF)

> summary(fitIT)

Call:
glm(formula = Admit ~ Gender * Dept, family = binomial, data = DF)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3793  -0.9768   0.3820   0.9127   1.8642  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -1.5442     0.2527  -6.110 9.94e-10 ***
GenderMale         1.0521     0.2627   4.005 6.21e-05 ***
DeptB              0.7904     0.4977   1.588  0.11224    
DeptC              2.2046     0.2672   8.252  < 2e-16 ***
DeptD              2.1662     0.2750   7.878 3.32e-15 ***
DeptE              2.7013     0.2790   9.682  < 2e-16 ***
DeptF              4.1250     0.3297  12.512  < 2e-16 ***
GenderMale:DeptB  -0.8321     0.5104  -1.630  0.10306    
GenderMale:DeptC  -1.1770     0.2996  -3.929 8.53e-05 ***
GenderMale:DeptD  -0.9701     0.3026  -3.206  0.00135 ** 
GenderMale:DeptE  -1.2523     0.3303  -3.791  0.00015 ***
GenderMale:DeptF  -0.8632     0.4027  -2.144  0.03206 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6044.3  on 4525  degrees of freedom
Residual deviance: 5167.3  on 4514  degrees of freedom
AIC: 5191.3

Number of Fisher Scoring iterations: 5


Two key things to notice:

1. Note that the coefficients of the main effects of Gender and Dept
change materially from the first model to the second, demonstrating the
impact of the interaction terms.

2. The interaction terms themselves are generally quite significant.



Further, running:

> anova(fit, fitIT,  test = "Chisq")
Analysis of Deviance Table

Model 1: Admit ~ Gender + Dept
Model 2: Admit ~ Gender * Dept
  Resid. Df Resid. Dev   Df Deviance P(>|Chi|)
1      4519     5187.5                        
2      4514     5167.3    5     20.2  0.001144


pretty well confirms that the impact of including the interaction terms
in the model is quite significant and that there is confounding going on
between Gender and Department, which is arguably, more interesting than
the isolated impact of Gender on Admissions.

You might want to look further at the help page for the dataset as I
noted above, to gain further graphical insight into the data.


To your query, the goal is not to 'make' the Woolf test non-significant,
but to better understand why it is significant to start with and dig
deeper into your data to gain further insight.

HTH,

Marc Schwartz



More information about the R-help mailing list