Measures of Assocation using

A common task in epidemiology is to quantify the strength of association between exposures (‘risk factors’) and disease outcomes. In this context the term ‘exposure’ is taken to mean a variable whose association with the outcome is to be estimated.

Exposures can be harmful, beneficial or both harmful and beneficial (e.g., if an immunisable disease is circulating, exposure to immunising agents helps most recipients but may harm those who experience adverse reactions). The term ‘outcome’ is used to describe all the possible results that may arise from exposure to a causal factor or from preventive or therapeutic interventions (Porta, Greenland, and Last 2014). In human and animal health an ‘outcome-positive’ individual is an individual who has experienced a given disease of interest.

In this vignette we outline describe how epiR can be used to compute the various measures of association used in epidemiology notably the risk ratio, odds ratio, attributable risk in the exposed, attributable fraction in the exposed, attributable risk in the population and attributable fraction in the population. Examples are provided to demonstrate how the package can be used to deal with exposure-outcome data presented in various formats.

This vignette has been written assuming the reader routinely formats their 2 $\times$ 2 table data with the outcome status as columns and exposure status as rows. If this is not the case the argument outcome = "as.columns" (the default) can be changed to outcome = "as.rows".

The EpiToolbox app for iPhone and Android devices provides access to many of the measures of association functions in epiR using a smart phone.

Measures of association strength

The incidence risk ratio

Consider a study where study subjects are monitored for the presence of disease over a defined follow-up period. At the start of the follow-up period study subjects are classified according to exposure to a hypothesised risk factor. If the exposure and outcome are binary variables (yes or no) we can present the counts of subjects in each of the four exposure-outcome categories in a 2 $\times$ 2 table.

When our data are in this format we can calculate the incidence risk of the outcome in those that were exposed $R_{[D+|E+]}$, the incidence risk in those that were not exposed $R_{[D+|E-]}$ and finally, the incidence risk in both the exposed and the unexposed (i.e., the entire study population) $R_{[D+|E\pm]}$:

Table 1: A 2 × 2 table with incidence risks calculated for the exposure positive, the exposure negative and the entire study population.
	Outcome+	Outcome-	Total	Risk
Exposure+	a	b	a + b	R[D+\|E+] = a ÷ (a + b)
Exposure-	c	c	c + d	R[D+\|E-] = c ÷ (c + d)
Total	a + c	b + c	a + b + c + d	R[D+\|E±] = (a + c) ÷ (a + b + c + d)

In Figure 1 we present the outcome incidence risk in the exposed, unexposed and exposed plus unexposed groups as a bar chart. The outcome incidence risk ratio equals the outcome incidence risk in the exposed divided by the outcome incidence risk in the unexposed:

Figure 1: The incidence risk ratio.

The outcome incidence risk ratio provides an estimate of how many times more likely exposed individuals are to experience the outcome of interest, compared with non-exposed individuals.

If the outcome incidence risk ratio equals 1, then the risk of the outcome in both the exposed and non-exposed groups are the same. If the outcome incidence risk ratio is greater than 1, then exposure increases the outcome incidence risk with greater departures from 1 indicative of a stronger association If the outcome incidence risk ratio is less than 1, exposure reduces the outcome incidence risk and exposure is said to be protective.

Tip: The glossary in the EpiToolbox app contains an animated version of Figure 1.

The odds ratio — cohort studies

In a cohort study definition of exposure status (exposure-positive, exposure-negative) comes first. Subjects are then followed over time to determine their outcome status (outcome-positive, outcome-negative). The outcome odds in the exposed and unexposed populations are calculated as follows:

Table 2: A 2 × 2 table with incidence odds calculated for the exposure positive, the exposure negative and the entire study population.
	Outcome+	Outcome-	Total	Odds
Exposure+	a	b	a + b	O[D+\|E+] = a ÷ b
Exposure-	c	d	c + d	O[D+\|E-] = c ÷ d
Total	a + c	b + d	a + b + c + d	O[D+\|E±] = (a + c) ÷ (b + d)

The outcome odds ratio for a cohort study is the outcome odds in the exposed divided by the outcome odds in the unexposed.

The odds ratio — case-control studies

In a case-control study outcome status (outcome-positive, outcome-negative) is defined first. The history provided by each study subject then provides information about exposure status (exposure-positive, exposure-negative). For case-control studies, instead of talking about the outcome odds in the exposed and unexposed groups (as we did when we were working with data from a cohort study) we talk about the odds of exposure odds in the case (outcome-positive) and control (outcome-negative) groups. To emphasise this difference we list outcome status (case, control) as rows and exposure status (exposure-positive, exposure-negative) in our 2 $\times$ 2 table. An effective way to remember this is to think about the temporal order of events in a study. In a case-control study outcome status is defined first: exposure status is defined second. Because outcome status comes first it is listed as the first column of the 2 $\times$ 2 table.

Table 3: A 2 × 2 table with incidence odds calculated for the outcome positive, the outcome negative and the entire study population.
	Exposure+	Exposure-	Total	Odds
Outcome+	a	b	a + b	O[E+\|D+] = a ÷ b
Outcome-	c	d	c + d	O[E+\|D-] = c ÷ d
Total	a + c	b + d	a + b + c + d	O[E+\|D±] = (a + c) ÷ (b + d)

The odds ratio for a case-control study is defined as exposure odds in the cases $O_{[E+|D+]}$ divided by the exposure odds in the controls $O_{[E+|D-]}$.

Note that the numeric estimate of the odds ratio for a case-contrl study is exactly the same as that calculated for a cohort study. The expression of the result is the only thing that differs. Key points:

In a cohort study we talk about the outcome odds being $x$ times greater (or less) in the exposed, compared with the unexposed.
In a case-control study we talk about the exposure odds being $x$ times greater (or less) in cases, compared with controls.

Measures of effect in the exposed

The attributable risk in the exposed

The attributable risk is defined as the increase or decrease in the incidence risk of the outcome in the exposed that is attributable to exposure (Figure 2). The attributable risk (unlike the incidence risk ratio) provides a measure of the absolute frequency of the outcome that is associated with exposure.

Figure 2: The attributable risk in the exposed.

A useful way of expressing attributable risk in a clinical setting is in terms of the number needed to treat, NNT. NNT equals the inverse of the attributable risk. Depending on the outcome of interest we use different labels for NNT.

When dealing with an outcome that is ‘desirable’ (e.g., treatment success) we call NNT the number needed to treat for benefit, NNTB. NNTB equals the number of subjects who would have to be exposed to result in a single (desirable) outcome. When dealing with an outcome that is ‘undesirable’ (e.g., death) we call NNT the number needed to treat for harm, NNTH. NNTH equals the number of subjects who would have to be exposed to result in a single (undesirable) outcome.

The attributable fraction in the exposed

The attributable fraction in the exposed is the proportion of outcome-positive subjects in the exposed group that is due to exposure (Figure 3).

$Figure 3: The attributable fraction in the exposed.$

Figure 3: The attributable fraction in the exposed.

Measures of effect in the population

The attributable risk in the population

The population attributable risk is the increase or decrease in incidence risk of the outcome in the study population that is attributable to exposure (Figure 4).

Figure 4: The attributable risk in the population.

The attributable fraction in the population

The population attributable fraction (also known as the aetiologic fraction) is the proportion of outcome-positive subjects in the study population that is due to the exposure (Figure 5).

$Figure 5: The attributable fraction in the population.$

Figure 5: The attributable fraction in the population.

On the condition that the exposure of interest is a cause of the disease outcome, the population attributable fraction represents the proportional reduction in average disease risk over a specified period of time that would be achieved by eliminating the exposure while the distribution of other risk factors in the population remained unchanged. For this reason, PAFs are particularly useful to guide policy makers when planning public health interventions.

In summary, if you’re going to use PAFs as a means for informing policy, make sure that: (1) the exposure of interest is causally related to the outcome; and (2) the exposure of interest is something amenable to intervention.

Theory to practice: Calculating measures of association using R

Direct entry of 2 by 2 table contingency table cell frequencies

Firstly, a 2 $\times$ 2 table can be created by listing the contingency table cell frequencies in vector format. Take the following example.

A cohort study investigating the relationship between dry cat food (DCF) and presence or absence of feline lower urinary tract disease (FLUTD) was conducted (Willeberg 1977). Counts of individuals in each group were as follows. DCF-exposed cats (outcome-positive, outcome-negative) 13, 2163. Non DCF-exposed cats (outcome-positive, outcome-negative) 5, 3349. We can enter these data directly into R as a vector of length four. Check that your counts have been entered in the correct order by viewing the data as a matrix.

dat.v01 <- c(13,2163,5,3349); dat.v01
#> [1]   13 2163    5 3349

# View the data in the usual 2 by 2 table format:
matrix(dat.v01, nrow = 2, byrow = TRUE)
#>      [,1] [,2]
#> [1,]   13 2163
#> [2,]    5 3349

Calculate the outcome incidence risk ratio, incidence odds ratio, attributable prevalence in the exposed, attributable fraction in the exposed, attributable prevalence in the population and the attributable fraction in the population using function epi.2by2. Note that we use the terms incidence risk ratio and incidence odds ratio here because we’re dealing with data from a cohort study — following determination of dry food exposure status cats have been followed over time to determine the incidence risk of FLUTD.

library(epiR)

epi.2by2(dat = dat.v01, method = "cohort.count", elab = "Dry food", olab = "FLUTD", digits = 2, conf.level = 0.95, units = 100, interpret = FALSE, outcome = "as.columns")
#>              FLUTD+    FLUTD-      Total                 Inc risk *
#> Dry food+        13      2163       2176        0.60 (0.32 to 1.02)
#> Dry food-         5      3349       3354        0.15 (0.05 to 0.35)
#> Total            18      5512       5530        0.33 (0.19 to 0.51)
#> 
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio                                 4.01 (1.43, 11.23)
#> Inc odds ratio                                 4.03 (1.43, 11.31)
#> Attrib risk in the exposed *                   0.45 (0.10, 0.80)
#> Attrib fraction in the exposed (%)            75.05 (32.91, 90.72)
#> Attrib risk in the population *                0.18 (-0.02, 0.38)
#> Attrib fraction in the population (%)         54.20 (32.38, 74.91)
#> -------------------------------------------------------------------
#> Uncorrected chi2 test that OR = 1: chi2(1) = 8.177 Pr>chi2 = 0.004
#> Fisher exact test that OR = 1: Pr>chi2 = 0.006
#>  Wald confidence limits
#>  CI: confidence interval
#>  * Outcomes per 100 population units

The incidence risk of FLUTD among those that were dry food positive was 4.0 (95% CI 1.4 to 11) times the incidence risk of FLUTD among those that were dry food negative.

In DCF exposed cats, 75% of FLUTD was attributable to DCF (95% CI 33% to 91%). Fifty-four percent of FLUTD cases in this cat population were attributable to DCF (95% CI 32% to 75%).

Need a hand to get the correct wording to explain each of the listed measures of association and measures of effect? Set interpret = TRUE in epi.2by2:

epi.2by2(dat = dat.v01, method = "cohort.count", elab = "Dry food", olab = "FLUTD", digits = 2, conf.level = 0.95, units = 100, interpret = TRUE, outcome = "as.columns")
#>              FLUTD+    FLUTD-      Total                 Inc risk *
#> Dry food+        13      2163       2176        0.60 (0.32 to 1.02)
#> Dry food-         5      3349       3354        0.15 (0.05 to 0.35)
#> Total            18      5512       5530        0.33 (0.19 to 0.51)
#> 
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio                                 4.01 (1.43, 11.23)
#> Inc odds ratio                                 4.03 (1.43, 11.31)
#> Attrib risk in the exposed *                   0.45 (0.10, 0.80)
#> Attrib fraction in the exposed (%)            75.05 (32.91, 90.72)
#> Attrib risk in the population *                0.18 (-0.02, 0.38)
#> Attrib fraction in the population (%)         54.20 (32.38, 74.91)
#> -------------------------------------------------------------------
#> Uncorrected chi2 test that OR = 1: chi2(1) = 8.177 Pr>chi2 = 0.004
#> Fisher exact test that OR = 1: Pr>chi2 = 0.006
#>  Wald confidence limits
#>  CI: confidence interval
#>  * Outcomes per 100 population units 
#> 
#>  Measures of association strength:
#>  The incidence risk of FLUTD among those that were dry food positive was 4.01 (95% CI 1.43 to 11.23) times the incidence risk of FLUTD among those that were dry food negative. The incidence odds of FLUTD among those that were dry food positive was 4.03 (95% CI 1.43 to 11.31) times the incidence odds of FLUTD among those that were dry food negative. 
#> 
#>  Measures of effect in the exposed:
#>  Exposure to dry food changed the incidence risk of FLUTD among those that were dry food positive by +0.45 (95% CI +0.1 to +0.8) per 100 population units. Among those that were dry food positive 75% (95% CI 33% to 91%) of FLUTD cases were attributable to dry food. 
#> 
#>  Number needed to treat:
#>  Exposure to dry food changed the incidence risk of FLUTD among those that were dry food positive by +0.45 (95% CI +0.1 to +0.8) per 100 population units. The number needed to expose to dry food to increase FLUTD frequency by one was 223 (95% CI 125 to 1008). 
#> 
#>  Measures of effect in the population:
#>  Exposure to dry food changed the incidence risk of FLUTD among those that were dry food positive and dry food negative by +0.18 (95% CI -0.02 to +0.38) per 100 population units. Among those that were dry food positive and dry food negative 54% (95% CI 32% to 75%) of FLUTD cases were attributable to dry food.

Data frame with one row per observation

Here we provide examples where you have exposure status and outcome status listed for each member of your study population. There are two options for contingency table preparation in this situation: (1) using base R’s table function; or (2) using functions available in the tidyverse group of packages.

For this example we use the low infant birth weight data presented by Hosmer and Lemeshow (2000) and available in the MASS package in R. The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Massachusetts USA during 1986.

Two by two table preparation using the `table` function in base R

library(MASS)

# Load and view the data:
dat.df02 <- birthwt; head(dat.df02)
#>    low age lwt race smoke ptl ht ui ftv  bwt
#> 85   0  19 182    2     0   0  0  1   0 2523
#> 86   0  33 155    3     0   0  0  0   3 2551
#> 87   0  20 105    1     1   0  0  0   1 2557
#> 88   0  21 108    1     1   0  0  1   2 2594
#> 89   0  18 107    1     1   0  0  1   0 2600
#> 91   0  21 124    3     0   0  0  0   0 2622

Each row of this data set provides details of a child born to a single mother. We’re interested in the association between smoke (the mother’s smoking status during pregnancy) and low (delivery of a baby less than 2.5 kg bodyweight).

Its important that the table you present to epi.2by2 is in the correct format: Outcome positives in the first column, outcome negatives in the second column, exposure positives in the first row and exposure negatives in the second row. If we run the table function on the bwt data the output table is in the incorrect format. Outcome negatives are listed first in the columns of the table and exposure negatives are listed first in the rows:

dat.tab02 <- table(dat.df02$smoke, dat.df02$low, dnn = c("Smoke", "Low BW")); dat.tab02
#>      Low BW
#> Smoke  0  1
#>     0 86 29
#>     1 44 30

There are two ways to fix this problem. The quick fix is to simply ask R to switch the order of the rows and columns in the output table:

dat.tab02 <- table(dat.df02$smoke, dat.df02$low, dnn = c("Smoke", "Low BW")); dat.tab02
#>      Low BW
#> Smoke  0  1
#>     0 86 29
#>     1 44 30
dat.tab02 <- dat.tab02[2:1,2:1]; dat.tab02
#>      Low BW
#> Smoke  1  0
#>     1 30 44
#>     0 29 86

The second approach is to set the exposure variable and the outcome variable as a factor and to define the levels of each factor using levels = c(1,0):

# Variables low, smoke and race as factors. Put an 'f' in front of the variable names to remind you that they're factors:

dat.df02$flow <- factor(dat.df02$low, levels = c(1,0))
dat.df02$fsmoke <- factor(dat.df02$smoke, levels = c(1,0))
dat.df02$frace <- factor(dat.df02$race, levels = c(1,2,3))

dat.tab02 <- table(dat.df02$fsmoke, dat.df02$flow, dnn = c("Smoke", "Low BW")); dat.tab02
#>      Low BW
#> Smoke  1  0
#>     1 30 44
#>     0 29 86

With the table in the correct format we can compute the odds ratio for smoking and delivery of a low birth weight baby:

dat.epi02 <- epi.2by2(dat = dat.tab02, method = "cohort.count", elab = "Smoke", olab = "Low BW", digits = 2, conf.level = 0.95, units = 100, interpret = FALSE, outcome = "as.columns")
dat.epi02
#>           Low BW+    Low BW-      Total                 Inc risk *
#> Smoke+         30         44         74     40.54 (29.27 to 52.59)
#> Smoke-         29         86        115     25.22 (17.58 to 34.17)
#> Total          59        130        189     31.22 (24.69 to 38.35)
#> 
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio                                 1.61 (1.06, 2.44)
#> Inc odds ratio                                 2.02 (1.08, 3.78)
#> Attrib risk in the exposed *                   15.32 (1.61, 29.04)
#> Attrib fraction in the exposed (%)            37.80 (5.38, 58.95)
#> Attrib risk in the population *                6.00 (-4.33, 16.33)
#> Attrib fraction in the population (%)         19.22 (10.89, 28.78)
#> -------------------------------------------------------------------
#> Uncorrected chi2 test that OR = 1: chi2(1) = 4.924 Pr>chi2 = 0.026
#> Fisher exact test that OR = 1: Pr>chi2 = 0.036
#>  Wald confidence limits
#>  CI: confidence interval
#>  * Outcomes per 100 population units

The incidence odds of low BW among those mothers that were smoke positive was 2.0 (95% CI 1.1 to 3.8) times the incidence odds of low BW among those mothers that were smoke negative. All of the calculated measures of association computed by epi.2by2 can be listed using:

names(dat.epi02$massoc.detail)
#>  [1] "RR.strata.wald"      "RR.strata.taylor"    "RR.strata.score"    
#>  [4] "RR.strata.koopman"   "OR.strata.wald"      "OR.strata.cfield"   
#>  [7] "OR.strata.score"     "OR.strata.mle"       "ARisk.strata.wald"  
#> [10] "ARisk.strata.score"  "NNT.strata.wald"     "NNT.strata.score"   
#> [13] "AFRisk.strata.wald"  "PARisk.strata.wald"  "PARisk.strata.piri" 
#> [16] "PAFRisk.strata.wald" "wald.strata.rr"      "wald.strata.or"     
#> [19] "chi2.strata.uncor"   "chi2.strata.yates"   "chi2.strata.fisher" 
#> [22] "chi2.correction"

Compare the Wald confidence intervals for the odds ratio with the score confidence intervals:

dat.epi02$massoc.detail$OR.strata.wald
#>        est   lower    upper
#> 1 2.021944 1.08066 3.783112
# Wald confidence intervals: 2.0 (95% CI 1.1 to 3.8)

dat.epi02$massoc.detail$OR.strata.score
#>        est    lower    upper
#> 1 2.021944 1.084385 3.773885
# Score confidence intervals: 2.0 (95% CI 1.1 to 3.8)

Two by two table preparation using `tidyverse`

The dplyr and tidyr packages (available within the tidyverse group of packages) can also be used to prepare your data in the required format:

library(dplyr); library(tidyr)

dat.df03 <- birthwt; head(dat.df03)
#>    low age lwt race smoke ptl ht ui ftv  bwt
#> 85   0  19 182    2     0   0  0  1   0 2523
#> 86   0  33 155    3     0   0  0  0   3 2551
#> 87   0  20 105    1     1   0  0  0   1 2557
#> 88   0  21 108    1     1   0  0  1   2 2594
#> 89   0  18 107    1     1   0  0  1   0 2600
#> 91   0  21 124    3     0   0  0  0   0 2622

# Here we set the factor levels and tabulate the data in a single call using pipe operators:
dat.tab03 <- dat.df03 %>%
  mutate(flow = factor(low, levels = c(1,0), labels = c("yes","no"))) %>%
  mutate(fsmoke = factor(smoke, levels = c(1,0), labels = c("yes","no"))) %>%
  group_by(fsmoke, flow) %>%
  summarise(n = n()) 

# View the data:
dat.tab03
#> # A tibble: 4 × 3
#> # Groups:   fsmoke [2]
#>   fsmoke flow      n
#>   <fct>  <fct> <int>
#> 1 yes    yes      30
#> 2 yes    no       44
#> 3 no     yes      29
#> 4 no     no       86

# View the data in conventional 2 by 2 table format:
pivot_wider(dat.tab03, id_cols = c(fsmoke), 
   names_from = flow, values_from = n)
#> # A tibble: 2 × 3
#> # Groups:   fsmoke [2]
#>   fsmoke   yes    no
#>   <fct>  <int> <int>
#> 1 yes       30    44
#> 2 no        29    86

As before, compute the odds ratio for smoking and delivery of a low birth weight baby:

dat.epi03 <- epi.2by2(dat = dat.tab03, method = "cohort.count", elab = "Smoke", olab = "Low BW", digits = 2, conf.level = 0.95, units = 100, interpret = FALSE, outcome = "as.columns")
dat.epi03
#>           Low BW+    Low BW-      Total                 Inc risk *
#> Smoke+         30         44         74     40.54 (29.27 to 52.59)
#> Smoke-         29         86        115     25.22 (17.58 to 34.17)
#> Total          59        130        189     31.22 (24.69 to 38.35)
#> 
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio                                 1.61 (1.06, 2.44)
#> Inc odds ratio                                 2.02 (1.08, 3.78)
#> Attrib risk in the exposed *                   15.32 (1.61, 29.04)
#> Attrib fraction in the exposed (%)            37.80 (5.38, 58.95)
#> Attrib risk in the population *                6.00 (-4.33, 16.33)
#> Attrib fraction in the population (%)         19.22 (10.89, 28.78)
#> -------------------------------------------------------------------
#> Uncorrected chi2 test that OR = 1: chi2(1) = 4.924 Pr>chi2 = 0.026
#> Fisher exact test that OR = 1: Pr>chi2 = 0.036
#>  Wald confidence limits
#>  CI: confidence interval
#>  * Outcomes per 100 population units

The incidence odds of low BW among those mothers that were smoke positive was 2.0 (95% CI 1.1 to 3.8) times the incidence odds of low BW among those mothers that were smoke negative.

Measures of association for a series of candidate risk factors

The code that follows provides an approach for calculating measures of association for a series of candidate risk factors, writing the results of each analysis to a data frame. Here we set up a loop to calculate the odds ratio for age, smoke and race:

dat.df04 <- birthwt; head(dat.df04)
#>    low age lwt race smoke ptl ht ui ftv  bwt
#> 85   0  19 182    2     0   0  0  1   0 2523
#> 86   0  33 155    3     0   0  0  0   3 2551
#> 87   0  20 105    1     1   0  0  0   1 2557
#> 88   0  21 108    1     1   0  0  1   2 2594
#> 89   0  18 107    1     1   0  0  1   0 2600
#> 91   0  21 124    3     0   0  0  0   0 2622

dat.df04$flow <- factor(dat.df04$low, levels = c(1,0))

dat.df04$fage <- ifelse(dat.df04$age > 23, 0,1)
dat.df04$fage <- factor(dat.df04$fage, levels = c(1,0))
dat.df04$fsmoke <- factor(dat.df04$smoke, levels = c(1,0))

# Race is coded 1 = white, 2 = black and 3 = other. Set white as the reference level:
dat.df04$frace <- ifelse(dat.df04$race == 1, 0, 1)
dat.df04$frace <- factor(dat.df04$frace, levels = c(1,0))

# Empty vectors to collect results:
rfactor <- ref <- or.p <- or.l <- or.u <- c() 

# The candidate risk factors are in columns 12 to 14 of data frame dat.df04:
for(i in 12:14){
  tdat.tab04 <- table(dat.df04[,i], dat.df04$flow)
  tdat.epi04 <- epi.2by2(dat = tdat.tab04, method = "cohort.count", 
   digits = 2, conf.level = 0.95, units = 100, interpret = FALSE, outcome = "as.columns")
  
  trfactor <- as.character(names(dat.df04)[i]) 
  rfactor <- c(rfactor, trfactor) 
  
  tref <- as.character(paste("Reference: ", trfactor, " - ", levels(dat.df04[,i])[2], sep = ""))
  ref <- c(ref, tref)
  
  tor.p <- as.numeric(tdat.epi04$massoc.detail$OR.strata.wald[1])
  or.p <- c(or.p, tor.p)
  
  tor.l <- as.numeric(tdat.epi04$massoc.detail$OR.strata.wald[2])
  or.l <- c(or.l, tor.l)
  
  tor.u <- as.numeric(tdat.epi04$massoc.detail$OR.strata.wald[3])
  or.u <- c(or.u, tor.u)
}

gdat.df04 <- data.frame(ybrk = 1:3, ylab = rfactor, ref, or.p, or.l, or.u)
gdat.df04
#>   ybrk   ylab                   ref     or.p      or.l     or.u
#> 1    1   fage   Reference: fage - 0 1.174769 0.6294241 2.192609
#> 2    2 fsmoke Reference: fsmoke - 0 2.021944 1.0806596 3.783112
#> 3    3  frace  Reference: frace - 0 2.004577 1.0703040 3.754380

Plot the point estimate of the odds ratio for each candidate risk factor and its 95% confidence interval. Annotate the plot with the reference category for each risk factor. Note that a log 2 scale is used for the horizontal axis:

library(ggplot2); library(scales)

xbrk <- seq(from = -2, to = 2, by = 1)
xlab <- 2^xbrk

ggplot(data = gdat.df04, aes(x = log2(or.p), y = ybrk)) +
  theme_bw() +
  geom_point() + 
  geom_errorbarh(aes(xmin = log2(or.l), xmax = log2(or.u), height = 0.2)) + 
  scale_x_continuous(breaks = xbrk, labels = xlab, limits = range(xbrk), 
   name = "Odds ratio") + 
  scale_y_continuous(breaks = gdat.df04$ybrk, labels = gdat.df04$ylab, name = "Risk factor") + 
  geom_vline(xintercept = log2(1), linetype = "dashed") + 
  annotate("text", x = log2(0.25), y = gdat.df04$ybrk, label = gdat.df04$ref, hjust = 0, size = 3) +
  coord_fixed(ratio = 0.75 / 1) + 
  theme(axis.title.y = element_text(vjust = 0))
#> Warning: `geom_errobarh()` was deprecated in ggplot2 4.0.0.
#> ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning in geom_errorbar(..., orientation = orientation): Ignoring unknown
#> aesthetics: height

Figure 6: Risk factors for low birth weight babies. Error bar plot showing the point estimate of the odds ratio and its 95% confidence interval for maternal age, smoking and race.

Confounding

We’re concerned that the mother’s race may confound the association between low birth weight and delivery of a low birth weight baby so we’ll stratify the data by race and compute the Mantel-Haenszel adjusted odds ratio. As before, our tables can be prepared using either base R or tidyverse.

Stratified two by two table preparation using the table function in base R

dat.df05 <- birthwt; head(dat.df05)
#>    low age lwt race smoke ptl ht ui ftv  bwt
#> 85   0  19 182    2     0   0  0  1   0 2523
#> 86   0  33 155    3     0   0  0  0   3 2551
#> 87   0  20 105    1     1   0  0  0   1 2557
#> 88   0  21 108    1     1   0  0  1   2 2594
#> 89   0  18 107    1     1   0  0  1   0 2600
#> 91   0  21 124    3     0   0  0  0   0 2622

dat.df05$flow <- factor(dat.df05$low, levels = c(1,0))

dat.df05$fsmoke <- factor(dat.df05$smoke, levels = c(1,0))
dat.df05$frace <- factor(dat.df05$race, levels = c(1,2,3))

dat.tab05 <- table(dat.df05$fsmoke, dat.df05$flow, dat.df05$frace, 
   dnn = c("Smoke", "Low BW", "Race")); dat.tab05
#> , , Race = 1
#> 
#>      Low BW
#> Smoke  1  0
#>     1 19 33
#>     0  4 40
#> 
#> , , Race = 2
#> 
#>      Low BW
#> Smoke  1  0
#>     1  6  4
#>     0  5 11
#> 
#> , , Race = 3
#> 
#>      Low BW
#> Smoke  1  0
#>     1  5  7
#>     0 20 35

Compute the Mantel-Haenszel adjusted odds ratio for smoking and delivery of a low birth weight baby, adjusting for the effect of race. Function epi.2by2 automatically returns the Mantel-Haenszel odds ratio and risk ratio when presented with stratified contingency tables.

dat.epi05 <- epi.2by2(dat = dat.tab05, method = "cohort.count", elab = "smoke",
   olab = "Low BW", digits = 2, conf.level = 0.95, units = 100, 
   interpret = FALSE, outcome = "as.columns")
dat.epi05
#>           Low BW+    Low BW-      Total                 Inc risk *
#> smoke+         30         44         74     40.54 (29.27 to 52.59)
#> smoke-         29         86        115     25.22 (17.58 to 34.17)
#> Total          59        130        189     31.22 (24.69 to 38.35)
#> 
#> 
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio (crude)                         1.61 (1.06, 2.44)
#> Inc risk ratio (M-H)                           2.15 (1.29, 3.58)
#> Inc risk ratio (crude:M-H)                     0.75
#> Inc odds ratio (crude)                         2.02 (1.08, 3.78)
#> Inc odds ratio (M-H)                           3.09 (1.49, 6.39)
#> Inc odds ratio (crude:M-H)                     0.66
#> Attrib risk in the exposed (crude) *           15.32 (1.61, 29.04)
#> Attrib risk in the exposed (M-H) *             22.17 (1.41, 42.94)
#> Attrib risk (crude:M-H)                        0.69
#> -------------------------------------------------------------------
#>  M-H test of homogeneity of IRRs: chi2(2) = 3.862 Pr>chi2 = 0.145
#>  M-H test of homogeneity of ORs: chi2(2) = 2.800 Pr>chi2 = 0.247
#>  Test that M-H adjusted OR = 1:  chi2(1) = 9.413 Pr>chi2 = 0.002
#>  Wald confidence limits
#>  M-H: Mantel-Haenszel; CI: confidence interval
#>  * Outcomes per 100 population units

The Mantel-Haenszel test of homogeneity of the strata odds ratios is not significant ($\chi^{2}$ test statistic 2.800; df 2; p-value = 0.25). We accept the null hypothesis and conclude that the odds ratios for each strata of race are the same. Because the stratum specific odds ratios are not statistically significantly different the Mantel-Haenszel adjusted odds ratio provides an appropriate summary of the association between smoking and low birth weight.

After accounting for the confounding effect of race, the odds of having a low birth weight baby for smokers was 3.1 (95% CI 1.5 to 6.4) times that of non-smokers.

Stratified two by two table preparation using tidyverse

dat.df06 <- birthwt

dat.tab06 <- dat.df06 %>%
  mutate(flow = factor(low, levels = c(1,0), labels = c("yes","no"))) %>%
  mutate(fsmoke = factor(smoke, levels = c(1,0), labels = c("yes","no"))) %>%
  mutate(frace = factor(race)) %>%
  group_by(frace, fsmoke, flow) %>%
  summarise(n = n()) 
dat.tab06
#> # A tibble: 12 × 4
#> # Groups:   frace, fsmoke [6]
#>   frace fsmoke flow      n
#>   <fct> <fct>  <fct> <int>
#> 1 1     yes    yes      19
#> 2 1     yes    no       33
#> 3 1     no     yes       4
#> 4 1     no     no       40
#> # ℹ 8 more rows

# View the data in conventional 2 by 2 table format:
pivot_wider(dat.tab06, id_cols = c(frace, fsmoke), 
   names_from = flow, values_from = n)
#> # A tibble: 6 × 4
#> # Groups:   frace, fsmoke [6]
#>   frace fsmoke   yes    no
#>   <fct> <fct>  <int> <int>
#> 1 1     yes       19    33
#> 2 1     no         4    40
#> 3 2     yes        6     4
#> 4 2     no         5    11
#> # ℹ 2 more rows

Compute the Mantel-Haenszel adjusted odds ratio for smoking and delivery of a low birth weight baby, adjusting for the effect of race:

dat.epi06 <- epi.2by2(dat = dat.tab06, method = "cohort.count", 
   elab = "Smoke", olab = "Low BW", digits = 2, conf.level = 0.95, 
   units = 100, interpret = FALSE, outcome = "as.columns")

dat.epi06
#>           Low BW+    Low BW-      Total                 Inc risk *
#> Smoke+         30         44         74     40.54 (29.27 to 52.59)
#> Smoke-         29         86        115     25.22 (17.58 to 34.17)
#> Total          59        130        189     31.22 (24.69 to 38.35)
#> 
#> 
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio (crude)                         1.61 (1.06, 2.44)
#> Inc risk ratio (M-H)                           2.15 (1.29, 3.58)
#> Inc risk ratio (crude:M-H)                     0.75
#> Inc odds ratio (crude)                         2.02 (1.08, 3.78)
#> Inc odds ratio (M-H)                           3.09 (1.49, 6.39)
#> Inc odds ratio (crude:M-H)                     0.66
#> Attrib risk in the exposed (crude) *           15.32 (1.61, 29.04)
#> Attrib risk in the exposed (M-H) *             22.17 (1.41, 42.94)
#> Attrib risk (crude:M-H)                        0.69
#> -------------------------------------------------------------------
#>  M-H test of homogeneity of IRRs: chi2(2) = 3.862 Pr>chi2 = 0.145
#>  M-H test of homogeneity of ORs: chi2(2) = 2.800 Pr>chi2 = 0.247
#>  Test that M-H adjusted OR = 1:  chi2(1) = 9.413 Pr>chi2 = 0.002
#>  Wald confidence limits
#>  M-H: Mantel-Haenszel; CI: confidence interval
#>  * Outcomes per 100 population units

Plot the individual strata odds ratios and the Mantel-Haenszel summary odds ratio as an error bar plot to better understand how the Mantel-Haenszel adjusted odds ratio relates to the individual strata odds ratios:


xbrk <- seq(from = -5, to = 5, by = 1)
xlab <- 2^xbrk

nstrata <- dat.epi06$n.strata
ybrk <- c(1:nstrata, max(nstrata) + 1)
ylab <- c("M-H", paste("Strata ", 1:nstrata, sep = ""))

or.p <- c(dat.epi06$massoc.detail$OR.mh$est, 
   dat.epi06$massoc.detail$OR.strata.cfield$est)
or.l <- c(dat.epi06$massoc.detail$OR.mh$lower, 
   dat.epi06$massoc.detail$OR.strata.cfield$lower)
or.u <- c(dat.epi06$massoc.detail$OR.mh$upper, 
   dat.epi06$massoc.detail$OR.strata.cfield$upper)
gdat.df06 <- data.frame(ybrk, ylab, or.p, or.l, or.u)

ggplot(data = gdat.df06, aes(x = log2(or.p), y = ybrk)) +
  theme_bw() +
  geom_point() + 
  geom_errorbarh(aes(xmin = log2(or.u), xmax = log2(or.l), height = 0.2)) + 
  scale_x_continuous(breaks = xbrk, labels = xlab, limits = range(xbrk), 
   name = "Odds ratio") + 
  scale_y_continuous(breaks = gdat.df06$ybrk, labels = gdat.df06$ylab, name = "Risk factor") + 
  geom_vline(xintercept = log2(1), linetype = "dashed") + 
  annotate("text", x = log2(0.03125), y = gdat.df06$yat, label = gdat.df06$ref, hjust = 0, size = 3) +
  coord_fixed(ratio = 0.75 / 1) + 
  theme(axis.title.y = element_text(vjust = 0))
#> Warning in geom_errorbar(..., orientation = orientation): Ignoring unknown
#> aesthetics: height
#> Warning in annotate("text", x = log2(0.03125), y = gdat.df06$yat, label =
#> gdat.df06$ref, : Ignoring empty aesthetic: `label`.

Figure 7: Risk factors for low birth weight babies. Error bar plot showing the odds of having a low birth weight baby for smokers of maternal race categories 1, 2 and 3 and the Mantel-Haenszel odds of having a low birth weight baby for smokers, adjusted for maternal race.

References

Altman, DG. 1998. “Confidence Intervals for the Number Needed to Treat.” British Medical Journal 317: 1309–12.

Grimes, DA, and KF Schulz. 2008. “Making Sense of Odds and Odds Ratios.” Obstetrics and Gynecology 111: 423–26.

Hosmer, DW, and S Lemeshow. 2000. Applied Logistic Regression. London: Jon Wiley; Sons Inc.

Kuritz, SJ, JR Landis, and GG Koch. 1988. “A general overview of Mantel-Haenszel methods: Applications and recent developments.” Annual Reviews in Public Health 9: 123–60.

Porta, M, S Greenland, and JM Last. 2014. A Dictionary of Epidemiology. London: Oxford University Press.

Prasad, K, R Jaeschke, P Wyer, S Keitz, and G Guyatt. 2007. “Tips for teachers of evidence-based medicine: Understanding odds ratios and their relationship to risk ratios.” Journal of General Internal Medicine 23 (5): 635–40.

Siegerink, B, and JL Rohmann. 2018. “Impact of your results: Beyond the relative risk.” Research and Practice in Thrombosis and Haemostasis 2: 653–57.

Willeberg, P. 1977. “Animal disease information processing: Epidemiologic analyses of the feline urologic syndrome.” Acta Veterinaria Scandinavica 64: 1–48.

Measures of Assocation using epiR

Mark Stevenson

2025-09-29

Measures of Assocation using

Measures of association strength

The incidence risk ratio

The odds ratio — cohort studies

The odds ratio — case-control studies

Measures of effect in the exposed

The attributable risk in the exposed

The attributable fraction in the exposed

Measures of effect in the population

The attributable risk in the population

The attributable fraction in the population

Theory to practice: Calculating measures of association using R

Direct entry of 2 by 2 table contingency table cell frequencies

Data frame with one row per observation

Two by two table preparation using the `table` function in base R

Two by two table preparation using `tidyverse`

Measures of association for a series of candidate risk factors

Confounding

Stratified two by two table preparation using the table function in base R

Stratified two by two table preparation using tidyverse

References

Measures of Assocation using epiR

Mark Stevenson

2025-09-29

Measures of Assocation using

Measures of association strength

The incidence risk ratio

The odds ratio — cohort studies

The odds ratio — case-control studies

Measures of effect in the exposed

The attributable risk in the exposed

The attributable fraction in the exposed

Measures of effect in the population

The attributable risk in the population

The attributable fraction in the population

Theory to practice: Calculating measures of association using R

Direct entry of 2 by 2 table contingency table cell frequencies

Data frame with one row per observation

Two by two table preparation using the table function in base R

Two by two table preparation using tidyverse

Measures of association for a series of candidate risk factors

Confounding

Stratified two by two table preparation using the table function in base R

Stratified two by two table preparation using tidyverse

References

Two by two table preparation using the `table` function in base R

Two by two table preparation using `tidyverse`