Chapter 4 Contrasts and Multiple Testing

4.1 Contrasts

The \(F\)-test is rather unspecific. It basically gives us a “Yes/No” answer for the question “is there any treatment effect at all?”. It doesn’t tell us what specific treatment (or treatment combination) is significant. Quite often we have a more specific question than the aforementioned global null hypothesis. E.g., we might want to compare a set of new treatments vs. a control treatment or we want to do pairwise comparisons between many (or all) treatments.

Such kinds of questions can typically be formulated as a so-called contrast. Let us start with a toy example based on the PlantGrowth data set. If we only wanted to compare trt1 (\(\mu_2\)) with ctrl (\(\mu_1\)) we could set up the null hypothesis \[ H_0: \mu_1 - \mu_2 = 0 \] vs. the alternative \[ H_A: \mu_1 - \mu_2 \neq 0. \] We can encode this with a vector \(c \in \mathbb{R}^g\) \[ H_0: \sum_{i=1}^g c_i \mu_i = 0. \] In this example the vector \(c\) is equal to \(c = (1, -1, 0)\) (with respect to ctrl, trt1 and trt2). Hence, a contrast is an encoding of our own, specific research question. A more sophisticated example would be \(c = (1, -1/2, -1/2)\) which compares ctrl versus the average value of trt1 and trt2.

Typically, we have the side-constraint \[ \sum_{i=1}^g c_i = 0 \] which ensures that the contrast is about differences between treatments and not about the overall level of our response.

We estimate a contrasts true (but unknown) value \(\sum_{i=1}^g c_i \mu_i\) (a linear combination of model parameters!) with \[ \sum_{i=1}^g c_i \widehat{\mu}_i \] In addition, we could derive its accuracy (standard error), construct confidence intervals and do tests. We omit the theoretical details and continue with our example.

In R we use the function glht (general linear hypotheses) of the add-on package multcomp (Hothorn, Bretz, and Westfall 2020).

library(multcomp)
fit.gh <- glht(fit, linfct = mcp(group = c(1, -1/2, -1/2)))
summary(fit.gh)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)
## 1 == 0  -0.0615     0.2414  -0.255    0.801
## (Adjusted p values reported -- single-step method)

This means we estimate the difference between ctrl and the average value of trt1 and trt2 as \(-0.0615\) and we are not rejecting the null hypothesis because the p-value is large (the annotation 1 == 0 means that this line tests whether the first (here: and only) contrast is zero or not). We get a confidence interval by using the function confint.

confint(fit.gh)
## 
##   Simultaneous Confidence Intervals
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## Quantile = 2.0518
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##        Estimate lwr     upr    
## 1 == 0 -0.0615  -0.5569  0.4339

The confidence interval for \(\mu_1- \frac{1}{2}(\mu_2 + \mu_3)\) is given by \([-0.5569, 0.4339]\).

4.1.1 Some Technical Details

Every contrast has an associated sum of squares \[ SS_c = \frac{\left(\sum_{i=1}^g c_i \overline{y}_{i\cdot}\right)^2}{\sum_{i=1}^g \frac{c_i^2}{n_i}} \] having one degree of freedom, hence \(MS_c = SS_c\). This looks unintuitive at first sight but it is nothing else than the square of the \(t\)-statistic of the corresponding null hypothesis for the special model parameter \(\sum_{i=1}^g c_i \mu_i\) (without the \(MS_E\) factor). Under \(H_0\) it holds that \[ \frac{MS_c}{MS_E} \sim F_{1,\, N-g}. \] Because \(F_{1, \, m}\) = \(t_m^2\) (the square of a \(t_m\)-distribution with \(m\) degrees of freedom), this is nothing else than the “squared version” of the \(t\)-test. You can think of \(MS_c\) as the “part” of \(MS_{\textrm{Trt}}\) in “direction” of \(c\).

Two contrasts \(c\) and \(c^*\) are called orthogonal if \[ \sum_{i=1}^g \frac{c_i c_i^*}{n_i} = 0. \] If two contrasts \(c\) and \(c^*\) are orthogonal, the corresponding estimates are (statistically) independent. This means that if we know something about one of the contrasts, this does not help us in making a statement about the other one.

If we have \(g\) treatments, we can find \(g - 1\) different orthogonal contrasts (one dimension is already used by the global mean \((1, \ldots, 1)\)). A set of orthogonal contrasts partitions the treatment sum of squares meaning that if \(c^{(1)}, \ldots, c^{(g)}\) are orthogonal contrasts it holds that \[ SS_{c^{(1)}} + \cdots + SS_{c^{(g-1)}} = SS_{\textrm{Trt}} \] Intuition: “We get all information about the treatment by asking the right \(g - 1\) questions”.

4.2 Multiple Testing

The problem with all statistical tests is the fact that the (overall) error rate increases with increasing number of tests. Assume that we perform \(m\) (independent) tests \(H_{0, j}\), \(j = 1, \ldots, m\), using an individual significance level of \(\alpha\). If all \(H_{0, j}\) are true, the probability to make at least one false rejection is given by \[ 1 - (1 - \alpha)^m. \] Even for \(\alpha\) small this is close to 1 if \(m\) is large. This means: if we perform many tests, we expect to find some significant results, even if all null hypotheses are true. Hence, we need a way to control the overall error rate. Let us first list the potential outcomes of a total of \(m\) tests, whereof \(m_0\) null hypotheses are true.

\(H_0\) true \(H_0\) false Total
Significant \(V\) \(S\) \(R\)
Not significant \(U\) \(T\) \(m - R\)
Total \(m_0\) \(m - m_0\) \(m\)

For example, \(V\) is the number of wrongly rejected null hypotheses.

The family-wise error rate is defined as the probability of rejecting at least one of the true \(H_0\)’s: \[ \textrm{FWER} = P(V \ge 1). \] The family-wise error rate is very strict in the sense that we are not considering the actual number of wrong decisions, we are just interested whether there is at least one. This means the situation where we make \(V = 1\) error is equally “bad” as the situation where we make \(V = 20\) errors.

We say that a procedure controls the family-wise error rate in the strong sense at level \(\alpha\) if \[ \textrm{FWER} \le \alpha \] for any configuration of true and non-true null hypotheses. A typical choice would be \(\alpha = 0.05\).

Another error rate is the false discovery rate (FDR) which is the expected fraction of false discoveries, \[ \textrm{FDR} = E \left[ \frac{V}{R} \right]. \] Controlling FDR at level 0.2 means that in our list of “significant findings” we expect only 20% that are not “true findings” (so called false positives). If we can live with a certain amount of false positives the relevant quantity to control is the false discovery rate.

If a procedure controls FWER at level \(\alpha\), FDR is automatically controlled at level \(\alpha\) too. On the other side, a procedure that controls FDR at level \(\alpha\) might have a much larger error rate regarding FWER. Hence, FWER is a much more strict (conservative) criterion (meaning: leading to fewer rejections).

We can also control the error rates for confidence intervals. We call a set of confidence intervals simultaneous confidence intervals at level \((1 - \alpha)\) if the probability that all intervals cover the corresponding true parameter value is \((1 - \alpha)\). This means: We can look at all confidence intervals at the same time and get the correct “big picture” with probability \((1 - \alpha)\).

In the following, we focus on the family-wise error rate (FWER) and simultaneous confidence intervals.

We typically start with “individual” p-values that we modify (or adjust) such that the appropriate overall error rate (like FWER) is being controlled. Interpretation of an individual p-value is as you learned it in your introductory course (“the probability to observe an event as extreme as …”). The modified p-values should be interpreted as the smallest overall error rate such that we can reject the corresponding null hypothesis.

4.2.1 Bonferroni

The Bonferroni correction is a generic but very conservative approach. The idea is to use a more restrictive (individual) significance level of \(\alpha^* = \alpha / m\). It controls the family-wise error rate in the strong sense. Equivalently, we can also multiply the “original” p-values by \(m\) and keep using the original \(\alpha\). Especially for large \(m\) the Bonferroni correction is very conservative. The confidence intervals based on the adjusted significance level are simultaneous.

We have a look at the previous example where we have two contrasts, \(c_1 = (1, -1/2, -1/2)\) (“control vs. the average of the remaining treatments”) and \(c_2 = (1, -1, 0)\) (“control vs. treatment 1”).

library(multcomp)
## Create a matrix where each *row* is a contrast
K <- rbind(c(1, -1/2, -1/2), ## ctrl vs. average of trt1 and trt2
           c(1, -1, 0))      ## ctrl vs. trt1
fit.gh <- glht(fit, linfct = mcp(group = K))

## Individual p-values
summary(fit.gh, test = adjusted("none"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)
## 1 == 0  -0.0615     0.2414  -0.255    0.801
## 2 == 0   0.3710     0.2788   1.331    0.194
## (Adjusted p values reported -- none method)
## Bonferroni corrected p-values
summary(fit.gh, test = adjusted("bonferroni"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)
## 1 == 0  -0.0615     0.2414  -0.255    1.000
## 2 == 0   0.3710     0.2788   1.331    0.389
## (Adjusted p values reported -- bonferroni method)

4.2.2 Bonferroni-Holm

Bonferroni-Holm is less conservative and uniformly more powerful than Bonferroni. It works as follows:

  1. Sort p-values from small to large: \(p_{(1)} \le p_{(2)} \le \ldots \le p_{(m)}\).
  2. For \(j = 1, 2, \ldots\): Reject null hypothesis if \(p_{(j)} \leq \frac{\alpha}{m-j+1}\)
  3. Stop when reaching the first non-significant p-value.

Note that only the smallest p-value has the traditional Bonferroni correction. With the multcomp package we can set the argument test of the function summary accordingly.

summary(fit.gh, test = adjusted("holm"))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## Linear Hypotheses:
##        Estimate Std. Error t value Pr(>|t|)
## 1 == 0  -0.0615     0.2414  -0.255    0.801
## 2 == 0   0.3710     0.2788   1.331    0.389
## (Adjusted p values reported -- holm method)

In addition, this is implemented in the function p.adjust in R.

4.2.3 Scheffé

The Scheffé procedure controls for the search over any possible contrast (!). This means we can try out as many contrasts as we like and still get honest p-values. The price for this very nice property is low power.

The Scheffé procedure works as follows: Calculate \(F\)-ratio as if ordinary contrast and use the distribution \((g - 1) \cdot F_{g-1, \, N - g}\) instead of \(F_{1, \, N - g}\) to calculate p-values or critical values.

We can manually do this in R with the multcomp package. We first calculate the contrast as an “ordinary” contrast and then do a manual calculation. As glht reports the value of the \(t\)-test, we have to take the square of it. As an example we consider the contrast \(c = (1/2, -1, 1/2)\).

fit.scheffe <- glht(fit, linfct = mcp(group = c(1/2, -1, 1/2)))

## p-value according to Scheffe (g = 3, N - g = 27)
pf((summary(fit.scheffe)$test$tstat)^2 / 2, 2, 27, lower.tail = FALSE)
##          1 
## 0.05323245

4.2.4 Tukey Honest Significant Difference (HSD)

A special case for a multiple testing problem is the comparison between all possible pairs of treatments. There are a total of \(g \cdot (g - 1) / 2\) pairs that we can inspect. We could perform all pairwise \(t\)-tests with the function pairwise.t.test (it uses a pooled standard deviation estimate from all groups).

## Without correction (but pooled sd estimate)
pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  PlantGrowth$weight and PlantGrowth$group 
## 
##      ctrl   trt1  
## trt1 0.1944 -     
## trt2 0.0877 0.0045
## 
## P value adjustment method: none

The output is a matrix of p-values of the corresponding comparisons (see row and column labels).

We could now use the Bonferroni-Holm correction method, i.e., p.adjust.method = "holm" to get p-values that are adjusted for multiple testing.

## With correction (and pooled sd estimate)
pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "holm")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  PlantGrowth$weight and PlantGrowth$group 
## 
##      ctrl  trt1 
## trt1 0.194 -    
## trt2 0.175 0.013
## 
## P value adjustment method: holm

There is a better (more powerful) alternative which is called Tukey Honest Significant Difference. Think of a procedure that is custom tailored for this situation. It gives us both p-values and confidence intervals. In R this is implemented in the function TukeyHSD or in the package multcomp.

## Tukey HSD with built-in function, including plots:
TukeyHSD(fit)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## $group
##             diff        lwr       upr     p adj
## trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
## trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
## trt2-trt1  0.865  0.1737839 1.5562161 0.0120064
plot(TukeyHSD(fit))

Only the difference between trt1 and trt2 is significant. We get of course the same results when using the package multcomp.

## Tukey HSD with package multcomp:
fit.tukey <- glht(fit, linfct = mcp(group = "Tukey"))
summary(fit.tukey)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)  
## trt1 - ctrl == 0  -0.3710     0.2788  -1.331   0.3909  
## trt2 - ctrl == 0   0.4940     0.2788   1.772   0.1980  
## trt2 - trt1 == 0   0.8650     0.2788   3.103   0.0119 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(glht(fit.tukey))
## 
##   Simultaneous Confidence Intervals
## 
## Fit: NULL
## 
## Quantile = 2.3441
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                  Estimate lwr     upr    
## trt1 - ctrl == 0 -0.3710  -1.0245  0.2825
## trt2 - ctrl == 0  0.4940  -0.1595  1.1475
## trt2 - trt1 == 0  0.8650   0.2115  1.5185
plot(confint(glht(fit.tukey)))

4.2.5 Multiple Comparison with a Control (MCC)

In the same spirit, if we want to compare all treatment groups with a control group, we have a so called multiple comparisons with a control problem. The corresponding procedure is called Dunnett procedure which is implemented in the add-on package multcomp. By default, the first level of the factor is taken as the control group.

fit.dunnett <- glht(fit, linfct = mcp(group = "Dunnett"))
summary(fit.dunnett)
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Dunnett Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)
## trt1 - ctrl == 0  -0.3710     0.2788  -1.331    0.323
## trt2 - ctrl == 0   0.4940     0.2788   1.772    0.153
## (Adjusted p values reported -- single-step method)

We get smaller p-values than with the Tukey HSD procedure because we have to correct for less tests (there are more comparisons between pairs than there are comparisons to the control treatment).

4.2.6 FAQ

Should I only do individual tests if the global \(F\)-test is significant?

No, although this still suggested in many textbooks. The above mentioned procedures have a built-in correction regarding multiple testing and do not rely on a significant \(F\)-test. Conditioning on the \(F\)-test leads to a very conservative approach regarding type I error rate. In addition, coverage rate of e.g. Tukey HSD confidence intervals can be very low.

Could it be the case that the \(F\)-test is significant but Tukey HSD yields only insignificant pairwise tests?

Yes, because the \(F\)-test can combine groups, Tukey HSD cannot (see example in the appendix).

Could it be the case that Tukey HSD yields a significant difference but the global \(F\)-test is not significant?

Yes, because Tukey has larger power for some alternatives because it tries to answer a more precise question (see example in the appendix).

4.2.7 Appendix

4.2.7.1 Tukey vs. \(F\)-test

We use the following two data sets consisting of three groups having two observations each.

x <- factor(rep(c("A", "B", "C"), each = 2))

y1 <- c(0.50, 0.62, 
        0.46, 0.63,
        0.95, 0.86)

y2 <- c(0.23, 0.34, 
        0.45, 0.55, 
        0.55, 0.66)
stripchart(y1 ~ x, vertical = TRUE, pch = 1)

For the first data set, the \(F\)-test is significant, but TukeyHSD is not.

fit1 <- aov(y1 ~ x)
summary(fit1)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## x            2 0.1659 0.08295   9.683 0.0491 *
## Residuals    3 0.0257 0.00857                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = y1 ~ x)
## 
## $x
##       diff         lwr       upr     p adj
## B-A -0.015 -0.40177079 0.3717708 0.9856937
## C-A  0.345 -0.04177079 0.7317708 0.0668983
## C-B  0.360 -0.02677079 0.7467708 0.0600951
stripchart(y2 ~ x, vertical = TRUE, pch = 1)

For the second data set, the \(F\)-test is not significant, but TukeyHSD is.

fit2 <- aov(y2 ~ x)
summary(fit2)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## x            2 0.1064 0.05322   9.336 0.0515 .
## Residuals    3 0.0171 0.00570                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = y2 ~ x)
## 
## $x
##      diff          lwr       upr     p adj
## B-A 0.215 -0.100489685 0.5304897 0.1268772
## C-A 0.320  0.004510315 0.6354897 0.0481916
## C-B 0.105 -0.210489685 0.4204897 0.4479192

Bibliography

Hothorn, Torsten, Frank Bretz, and Peter Westfall. 2020. Multcomp: Simultaneous Inference in General Parametric Models. https://CRAN.R-project.org/package=multcomp.