Exercise 6. Diet data: tabulating incidence rates and modelling with Poisson regression
You may have to install the required packages the first time you use
them. You can install a package by
install.packages("package_of_interest") for each package
you require.
library(biostat3) # diet dataset
library(dplyr)    # for data manipulation
library(knitr)    # kable()
library(broom)    # tidy()Load the diet data using time-on-study as the timescale. We look at
the first six rows of the data using head and look at a
summary for the variables using summary:
| id | chd | y | hieng | energy | job | month | height | weight | doe | dox | dob | yoe | yox | yob | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 127 | 0 | 16.791239 | low | 2023.25 | conductor | 2 | 173.9900 | 61.46280 | 1960-02-16 | 1976-12-01 | 1910-09-27 | 1960.126 | 1976.917 | 1910.737 | 
| 200 | 0 | 19.958933 | low | 2448.68 | bank | 12 | 177.8000 | 73.48320 | 1956-12-16 | 1976-12-01 | 1909-06-18 | 1956.958 | 1976.917 | 1909.460 | 
| 198 | 0 | 19.958933 | low | 2281.38 | bank | 12 | NA | NA | 1956-12-16 | 1976-12-01 | 1910-06-30 | 1956.958 | 1976.917 | 1910.493 | 
| 222 | 0 | 15.394935 | low | 2467.95 | bank | 2 | 158.7500 | 58.24224 | 1957-02-16 | 1972-07-10 | 1902-07-11 | 1957.126 | 1972.523 | 1902.523 | 
| 305 | 1 | 1.494867 | low | 2362.93 | bank | 1 | NA | NA | 1960-01-16 | 1961-07-15 | 1913-06-30 | 1960.041 | 1961.534 | 1913.493 | 
| 173 | 0 | 15.958932 | low | 2380.11 | conductor | 12 | 164.4904 | 79.01712 | 1960-12-16 | 1976-12-01 | 1915-06-28 | 1960.958 | 1976.917 | 1915.487 | 
| id | chd | y | hieng | energy | job | month | height | weight | doe | dox | dob | yoe | yox | yob | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1 | Min. :0.0000 | Min. : 0.2875 | low :155 | Min. :1748 | driver :102 | Min. : 1.000 | Min. :152.4 | Min. : 46.72 | Min. :1956-11-16 | Min. :1958-08-29 | Min. :1892-01-10 | Min. :1957 | Min. :1959 | Min. :1892 | |
| 1st Qu.: 85 | 1st Qu.:0.0000 | 1st Qu.:10.7762 | high:182 | 1st Qu.:2537 | conductor: 84 | 1st Qu.: 3.000 | 1st Qu.:168.9 | 1st Qu.: 64.64 | 1st Qu.:1959-01-16 | 1st Qu.:1972-09-29 | 1st Qu.:1906-01-18 | 1st Qu.:1959 | 1st Qu.:1973 | 1st Qu.:1906 | |
| Median :169 | Median :0.0000 | Median :15.4606 | NA | Median :2803 | bank :151 | Median : 6.000 | Median :173.0 | Median : 72.80 | Median :1960-02-16 | Median :1976-12-01 | Median :1911-02-25 | Median :1960 | Median :1977 | Median :1911 | |
| Mean :169 | Mean :0.1365 | Mean :13.6607 | NA | Mean :2829 | NA | Mean : 6.231 | Mean :173.4 | Mean : 72.54 | Mean :1960-06-22 | Mean :1974-02-19 | Mean :1911-01-04 | Mean :1960 | Mean :1974 | Mean :1911 | |
| 3rd Qu.:253 | 3rd Qu.:0.0000 | 3rd Qu.:17.0431 | NA | 3rd Qu.:3110 | NA | 3rd Qu.:10.000 | 3rd Qu.:177.8 | 3rd Qu.: 79.83 | 3rd Qu.:1961-06-16 | 3rd Qu.:1976-12-01 | 3rd Qu.:1915-01-30 | 3rd Qu.:1961 | 3rd Qu.:1977 | 3rd Qu.:1915 | |
| Max. :337 | Max. :1.0000 | Max. :20.0411 | NA | Max. :4396 | NA | Max. :12.000 | Max. :190.5 | Max. :106.14 | Max. :1966-09-16 | Max. :1976-12-01 | Max. :1930-09-19 | Max. :1967 | Max. :1977 | Max. :1931 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :5 | NA’s :4 | NA | NA | NA | NA | NA | NA | 
(a)
diet <- biostat3::diet
diet.ir6a <- survRate(Surv(y/1000,chd) ~ hieng, data=diet)
kable(diet.ir6a, "html")| hieng | tstop | event | rate | lower | upper | |
|---|---|---|---|---|---|---|
| hieng=low | low | 2.059430 | 28 | 13.595992 | 9.034438 | 19.64999 | 
| hieng=high | high | 2.544238 | 18 | 7.074809 | 4.192980 | 11.18125 | 
## or
diet |>
    group_by(hieng) |>
    summarise(Event = sum(chd), Time = sum(y/1000), Rate = Event/Time,      # group sums
              lower.ci = poisson.test(Event,Time)$conf.int[1],
              upper.ci = poisson.test(Event,Time)$conf.int[2]) |> kable("html")| hieng | Event | Time | Rate | lower.ci | upper.ci | 
|---|---|---|---|---|---|
| low | 28 | 2.059430 | 13.595992 | 9.034438 | 19.64999 | 
| high | 18 | 2.544238 | 7.074809 | 4.192980 | 11.18125 | 
## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 28, expected count1 = 20.578, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  1.026173 3.688904
## sample estimates:
## rate ratio 
##   1.921747## 
##  Comparison of Poisson rates
## 
## data:  rev(event) time base: rev(tstop)
## count1 = 18, expected count1 = 25.422, p-value = 0.03681
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.2710832 0.9744943
## sample estimates:
## rate ratio 
##  0.5203599We see that individuals with a high energy intake have a lower CHD incidence rate. The estimated crude incidence rate ratio is 0.52 (95% CI: 0.27, 0.97).
(b)
The code is:
## 
## Call:
## glm(formula = chd ~ hieng + offset(log(y/1000)), family = poisson, 
##     data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.6098     0.1890  13.811   <2e-16 ***
## hienghigh    -0.6532     0.3021  -2.162   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 258.00  on 335  degrees of freedom
## AIC: 354
## 
## Number of Fisher Scoring iterations: 6##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.3877130 19.6907371
## hienghigh    0.5203599 0.2878432  0.9407012## Waiting for profiling to be done...##              exp(beta)     2.5 %     97.5 %
## (Intercept) 13.5959916 9.1614552 19.2715805
## hienghigh    0.5203599 0.2829171  0.9328392## # A tibble: 2 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   13.6       0.189     13.8  2.20e-43    9.16     19.3  
## 2 hienghigh      0.520     0.302     -2.16 3.06e- 2    0.283     0.933The point estimate for the IRR calculated by the Poisson regression is the same as the IRR calculated in 6(a). The regression model can be defined by:
\[\begin{align*} E(\text{chd}) &= \frac{y}{1000}\exp\left(\beta_0 + \beta_1I(\text{hieng}="high") \right) \\ &= \exp\left(\beta_0 + \beta_1I(\text{hieng}="high") + \log(y/1000) \right) \end{align*}\]
where \(E(\text{chd})\) is the expected count for CHD, \(\beta_0\) is the intercept parameter for the log rate and \(\beta_1\) is the parameter for the log rate ratio for high energy diets. We have also used \(I(\text{condition})\) as an indicator function, which will take a value of 1 when the condition is true and 0 otherwise.
A theoretical observation: if we consider the data as being
cross-classified solely by hieng then the Poisson
regression model with one parameter is a saturated model (that
is, the number of observations is equal to the number of parameters) so
the IRR estimated from the model will be identical to the ‘observed’
IRR. That is, the model is a perfect fit.
(c)
hist(diet$energy, breaks=25, probability=TRUE, xlab="Energy (units)")
curve(dnorm(x, mean=mean(diet$energy), sd=sd(diet$energy)), col = "red", add=TRUE)
lines(density(diet$energy), col = "blue")
legend("topright", legend=c("Normal density","Smoothed density"),
       lty=1, col=c("red", "blue"), bty="n")##       1%       5%      10%      25%      50%      75%      90%      95% 
## 1887.268 2177.276 2314.114 2536.690 2802.980 3109.660 3365.644 3588.178 
##      99% 
## 4046.820The histogram and density curve gives us an idea of the distribution of energy intake. The normal curve suggests that a normal approximation may be acceptable. We can also tabulate moments and percentiles of the distribution.
(d)
diet <- transform(diet, eng3=cut(energy, breaks=c(1500,2500,3000,4500),
                                 labels=c("low","medium","high"), 
                                 right = FALSE))
cbind(Freq=table(diet$eng3),
      Prop=proportions(table(diet$eng3))) |> kable("html")| Freq | Prop | |
|---|---|---|
| low | 75 | 0.2225519 | 
| medium | 150 | 0.4451039 | 
| high | 112 | 0.3323442 | 
This could also be done using dplyr::mutate.
(e)
##               eng3     tstop event      rate    lower     upper
## eng3=low       low 0.9466338    16 16.901995 9.660951 27.447781
## eng3=medium medium 2.0172621    22 10.905871 6.834651 16.511619
## eng3=high     high 1.6397728     8  4.878725 2.106287  9.613033## [1] 0.6452416## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 22, expected count1 = 25.863, p-value = 0.2221
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.3237007 1.3143509
## sample estimates:
## rate ratio 
##  0.6452416## [1] 0.2886479## 
##  Comparison of Poisson rates
## 
## data:  event time base: tstop
## count1 = 8, expected count1 = 15.216, p-value = 0.004579
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.1069490 0.7148284
## sample estimates:
## rate ratio 
##  0.2886479We see that the CHD incidence rate decreases as the level of total energy intake increases. These calculations are usually better done using a regression model as they are less fiddly and can be adjusted for other covariates.
(f)
diet <- transform(diet, 
                  X1 = as.numeric(eng3 == "low"),
                  X2 = as.numeric(eng3 == "medium"),
                  X3 = as.numeric(eng3 == "high"))
## or (names eng3low, eng3medium and eng3high)
diet = cbind(diet, model.matrix(~eng3+0, diet))
colSums(diet[c("X1","X2","X3","eng3low","eng3medium","eng3high")])    X1         X2         X3    eng3low eng3medium   eng3high 
    75        150        112         75        150        112 (g)
| energy | eng3 | X1 | X2 | X3 | 
|---|---|---|---|---|
| 2023.25 | low | 1 | 0 | 0 | 
| 2448.68 | low | 1 | 0 | 0 | 
| 2281.38 | low | 1 | 0 | 0 | 
| 2467.95 | low | 1 | 0 | 0 | 
| 2362.93 | low | 1 | 0 | 0 | 
| 2380.11 | low | 1 | 0 | 0 | 
| energy | eng3 | X1 | X2 | X3 | |
|---|---|---|---|---|---|
| 76 | 2664.64 | medium | 0 | 1 | 0 | 
| 77 | 2533.33 | medium | 0 | 1 | 0 | 
| 78 | 2854.08 | medium | 0 | 1 | 0 | 
| 79 | 2673.77 | medium | 0 | 1 | 0 | 
| 80 | 2766.88 | medium | 0 | 1 | 0 | 
| 81 | 2586.69 | medium | 0 | 1 | 0 | 
| energy | eng3 | X1 | X2 | X3 | |
|---|---|---|---|---|---|
| 226 | 3067.36 | high | 0 | 0 | 1 | 
| 227 | 3298.95 | high | 0 | 0 | 1 | 
| 228 | 3147.60 | high | 0 | 0 | 1 | 
| 229 | 3180.47 | high | 0 | 0 | 1 | 
| 230 | 3045.81 | high | 0 | 0 | 1 | 
| 231 | 3060.03 | high | 0 | 0 | 1 | 
(h)
## 
## Call:
## glm(formula = chd ~ X2 + X3 + offset(log(y/1000)), family = poisson, 
##     data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## X2           -0.4381     0.3285  -1.334  0.18233    
## X3           -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   16.9       0.250     11.3  1.15e-29    9.91     26.6  
## 2 X2             0.645     0.329     -1.33 1.82e- 1    0.341     1.25 
## 3 X3             0.289     0.433     -2.87 4.11e- 3    0.117     0.656Level 1 of the categorized total energy is the reference category. The estimated rate ratio comparing level 2 to level 1 is 0.6452 and the estimated rate ratio comparing level 3 to level 1 is 0.2886.
The regression equation can be represented by:
\[\begin{align*} E(\text{chd}) &= \frac{y}{1000}\exp\left(\beta_0 + \beta_2 X_2 + \beta_3 X_3 \right) \\ &= \exp\left(\beta_0 + \beta_2 X_2 + \beta_3 X_3 + \log(y/1000) \right) \end{align*}\] where \(\beta_2\) and \(\beta_3\) are the log rate ratios for \(X_2=\text{X2}\) and \(X_3=\text{X3}\), respectively.
(i)
poisson6i <- glm(chd ~ X1 + X3 + offset(log(y/1000)), family=poisson, data=diet)
# or 
poisson6i <- glm(chd ~ I(eng3=="low") + I(eng3=="high") + offset(log(y/1000)),
                 family=poisson, data=diet)
summary(poisson6i)## 
## Call:
## glm(formula = chd ~ I(eng3 == "low") + I(eng3 == "high") + offset(log(y/1000)), 
##     family = poisson, data = diet)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.3893     0.2132  11.207   <2e-16 ***
## I(eng3 == "low")TRUE    0.4381     0.3285   1.334   0.1823    
## I(eng3 == "high")TRUE  -0.8044     0.4129  -1.948   0.0514 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6## # A tibble: 3 × 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 "(Intercept)"           10.9       0.213     11.2  3.77e-29    6.96     16.1  
## 2 "I(eng3 == \"low\")T…    1.55      0.329      1.33 1.82e- 1    0.800     2.94 
## 3 "I(eng3 == \"high\")…    0.447     0.413     -1.95 5.14e- 2    0.187     0.966Now use level 2 as the reference (by omitting X2 but including X1 and X3). The estimated rate ratio comparing level 1 to level 2 is 1.5498 and the estimated rate ratio comparing level 3 to level 2 is 0.4473.
(j)
## 
## Call:
## glm(formula = chd ~ eng3 + offset(log(y/1000)), family = poisson, 
##     data = diet)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.8274     0.2500  11.312  < 2e-16 ***
## eng3medium   -0.4381     0.3285  -1.334  0.18233    
## eng3high     -1.2425     0.4330  -2.870  0.00411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.82  on 336  degrees of freedom
## Residual deviance: 253.62  on 334  degrees of freedom
## AIC: 351.62
## 
## Number of Fisher Scoring iterations: 6## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   16.9       0.250     11.3  1.15e-29    9.91     26.6  
## 2 eng3medium     0.645     0.329     -1.33 1.82e- 1    0.341     1.25 
## 3 eng3high       0.289     0.433     -2.87 4.11e- 3    0.117     0.656The estimates are identical (as we would hope) when we have R create indicator variables for us.
(k)
Somehow (there are many different alternatives) you’ll need to calculate the total number of events and the total person-time at risk and then calculate the incidence rate as events/person-time. As examples,
##          rate
## 1 0.009992031##      tstop event     rate    lower    upper
## 1 4.603669    46 9.992031 7.315422 13.32797The estimated incidence rate is 0.00999 events per person-year.