# Chapter 4 Contrasts and Multiple Testing

## 4.1 Contrasts

The \(F\)-test is rather *unspecific*. It basically gives us a “Yes/No”
answer for the question “is there any treatment effect at all?” It
doesn’t tell us what specific treatment (or treatment combination) is
significant. Quite often we have a more specific question than the
aforementioned *global* null hypothesis. E.g., we might want to compare
a set of new treatments vs. a control treatment or we want to do
pairwise comparisons between many (or all) treatments.

Such kinds of questions can typically be formulated as a so-called
**contrast**. Let us start with a toy example based on
the `PlantGrowth`

data set. If we only wanted to compare `trt1`

(\(\mu_2\)) with `ctrl`

(\(\mu_1\)) we could set up the null hypothesis
\[
H_0: \mu_1 - \mu_2 = 0
\]
vs. the alternative
\[
H_A: \mu_1 - \mu_2 \neq 0.
\]
We can encode this with a vector \(c \in \mathbb{R}^g\)
\[
H_0: \sum_{i=1}^g c_i \mu_i = 0.
\]
In this example the vector \(c\) is equal to \(c = (1, -1, 0)\) (with
respect to `ctrl`

, `trt1`

and `trt2`

). Hence, a contrast is an encoding
of our own, specific research question. A more sophisticated example
would be \(c = (1, -1/2, -1/2)\) which compares `ctrl`

versus the average
value of `trt1`

and `trt2`

.

Typically, we have the side-constraint \[ \sum_{i=1}^g c_i = 0 \] which ensures that the contrast is about differences between treatments and not about the overall level of our response.

We estimate a contrasts true (but unknown) value \(\sum_{i=1}^g c_i \mu_i\) (a linear combination of model *parameters*!) with
\[
\sum_{i=1}^g c_i \widehat{\mu}_i
\]
In addition, we could derive its accuracy (standard error), construct
confidence intervals and do tests. We omit the theoretical details and
continue with our example.

In `R`

we use the function `glht`

(general linear hypotheses) of the
add-on package `multcomp`

(Hothorn, Bretz, and Westfall 2020).

```
library(multcomp)
<- glht(fit, linfct = mcp(group = c(1, -1/2, -1/2)))
fit.gh summary(fit.gh)
```

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.0615 0.2414 -0.255 0.801
## (Adjusted p values reported -- single-step method)
```

This means we estimate the difference between `ctrl`

and the average
value of `trt1`

and `trt2`

as \(-0.0615\) and we are not
rejecting the null hypothesis because the p-value is large (the
annotation `1 == 0`

means that this line tests whether the first (here:
and only) contrast is zero or not). We get a confidence interval by
using the function `confint`

.

`confint(fit.gh)`

```
##
## Simultaneous Confidence Intervals
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Quantile = 2.0518
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## 1 == 0 -0.0615 -0.5569 0.4339
```

The confidence interval for \(\mu_1- \frac{1}{2}(\mu_2 + \mu_3)\) is given by \([-0.5569, 0.4339]\).

### 4.1.1 Some Technical Details

Every contrast has an associated **sum of squares**
\[
SS_c = \frac{\left(\sum_{i=1}^g c_i \overline{y}_{i\cdot}\right)^2}{\sum_{i=1}^g \frac{c_i^2}{n_i}}
\]
having *one* degree of freedom, hence \(MS_c = SS_c\). This looks
unintuitive at first sight but it is nothing else than the square of the
\(t\)-statistic of the corresponding null hypothesis for the special model
parameter \(\sum_{i=1}^g c_i \mu_i\) (without the \(MS_E\) factor). Under
\(H_0\) it holds that
\[
\frac{MS_c}{MS_E} \sim F_{1,\, N-g}.
\]
Because \(F_{1, \, m}\) = \(t_m^2\) (the square of a \(t_m\)-distribution with
\(m\) degrees of freedom), this is nothing else than the “squared version”
of the \(t\)-test. You can think of \(MS_c\) as the “part” of
\(MS_{\textrm{Trt}}\) in “direction” of \(c\).

Two contrasts \(c\) and \(c^*\) are called **orthogonal**
if
\[
\sum_{i=1}^g \frac{c_i c_i^*}{n_i} = 0.
\]
If two contrasts \(c\) and \(c^*\) are orthogonal, the corresponding
estimates are (statistically) independent. This means that if we know
something about one of the contrasts, this does *not* help us in making
a statement about the other one.

If we have \(g\) treatments, we can find \(g - 1\) different orthogonal
contrasts (one dimension is already used by the global mean \((1, \ldots, 1)\)). A set of *orthogonal* contrasts **partitions** the treatment sum
of squares meaning that if \(c^{(1)}, \ldots, c^{(g)}\) are orthogonal contrasts
it holds that
\[
SS_{c^{(1)}} + \cdots + SS_{c^{(g-1)}} = SS_{\textrm{Trt}}
\]
Intuition: “We get all information about the treatment by asking the
right \(g - 1\) questions.”

## 4.2 Multiple Testing

The problem with all statistical tests is the fact that the (overall)
error rate increases with increasing number of tests. Assume that we
perform \(m\) (independent) tests \(H_{0, j}\), \(j = 1, \ldots, m\), using an
individual significance level of \(\alpha\). If all \(H_{0, j}\) are true,
the probability to make *at least one* false rejection is given by
\[
1 - (1 - \alpha)^m.
\]
Even for \(\alpha\) small this is close to 1 if \(m\) is large. This means:
if we perform many tests, we expect to find some significant results,
even if *all* null hypotheses are true. Hence, we need a way to control
the overall error rate. Let us first list the potential outcomes of a
total of \(m\) tests, whereof \(m_0\) null hypotheses are true.

\(H_0\) true | \(H_0\) false | Total | |
---|---|---|---|

Significant | \(V\) | \(S\) | \(R\) |

Not significant | \(U\) | \(T\) | \(m - R\) |

Total | \(m_0\) | \(m - m_0\) | \(m\) |

For example, \(V\) is the number of wrongly rejected null hypotheses.

The **family-wise error rate** is defined
as the probability of rejecting *at least one* of the *true* \(H_0\)’s:
\[
\textrm{FWER} = P(V \ge 1).
\]
The family-wise error rate is very strict in the sense that we are not
considering the actual *number* of wrong decisions, we are just
interested whether there is *at least one*. This means the situation
where we make \(V = 1\) error is equally “bad” as the situation where
we make \(V = 20\) errors.

We say that a procedure controls the family-wise error rate in the
*strong sense* at level \(\alpha\) if
\[
\textrm{FWER} \le \alpha
\]
for *any* configuration of true and non-true null hypotheses. A typical
choice would be \(\alpha = 0.05\).

Another error rate is the **false discovery rate (FDR)** which is the
expected fraction of false discoveries,
\[
\textrm{FDR} = E \left[ \frac{V}{R} \right].
\]
Controlling FDR at level 0.2 means that in our list of “significant
findings” we expect only 20% that are not “true findings” (so called
**false positives**). If we can live with a
certain amount of false positives the relevant quantity to control is
the false discovery rate.

If a procedure controls FWER at level \(\alpha\), FDR is *automatically*
controlled at level \(\alpha\) too. On the other side, a procedure that
controls FDR at level \(\alpha\) might have a much larger error rate
regarding FWER. Hence, FWER is a much more strict (conservative)
criterion (meaning: leading to fewer rejections).

We can also control the error rates for confidence intervals. We call a
set of confidence intervals **simultaneous confidence intervals** at
level \((1 - \alpha)\) if the probability that *all* intervals cover the
corresponding true parameter value is \((1 - \alpha)\). This means: We can
look at all confidence intervals at the same time and get the correct
“big picture” with probability \((1 - \alpha)\).

In the following, we focus on the family-wise error rate (FWER) and simultaneous confidence intervals.

We typically start with “individual” p-values that we modify (or adjust)
such that the appropriate *overall* error rate (like FWER) is being
controlled. Interpretation of an individual p-value is as you learned it
in your introductory course (“the probability to observe an event as
extreme as …”). The modified p-values should be interpreted as the
smallest overall error rate such that we can reject the corresponding
null hypothesis.

### 4.2.1 Bonferroni

The **Bonferroni** correction is a generic but very conservative
approach. The idea is to use a more restrictive (individual)
significance level of \(\alpha^* = \alpha / m\). It controls the
family-wise error rate in the strong sense. Equivalently, we can also
multiply the “original” p-values by \(m\) and keep using the original
\(\alpha\). Especially for large \(m\) the Bonferroni correction is very
conservative. The confidence intervals based on the adjusted
significance level are simultaneous.

We have a look at the previous example where we have two contrasts, \(c_1 = (1, -1/2, -1/2)\) (“control vs. the average of the remaining treatments”) and \(c_2 = (1, -1, 0)\) (“control vs. treatment 1”).

```
library(multcomp)
## Create a matrix where each *row* is a contrast
<- rbind(c(1, -1/2, -1/2), ## ctrl vs. average of trt1 and trt2
K c(1, -1, 0)) ## ctrl vs. trt1
<- glht(fit, linfct = mcp(group = K))
fit.gh
## Individual p-values
summary(fit.gh, test = adjusted("none"))
```

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.0615 0.2414 -0.255 0.801
## 2 == 0 0.3710 0.2788 1.331 0.194
## (Adjusted p values reported -- none method)
```

```
## Bonferroni corrected p-values
summary(fit.gh, test = adjusted("bonferroni"))
```

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.0615 0.2414 -0.255 1.000
## 2 == 0 0.3710 0.2788 1.331 0.389
## (Adjusted p values reported -- bonferroni method)
```

### 4.2.2 Bonferroni-Holm

**Bonferroni-Holm** is less conservative and uniformly more powerful than Bonferroni. It works
as follows:

- Sort p-values from small to large: \(p_{(1)} \le p_{(2)} \le \ldots \le p_{(m)}\).
- For \(j = 1, 2, \ldots\): Reject null hypothesis if \(p_{(j)} \leq \frac{\alpha}{m-j+1}\)
- Stop when reaching the
*first*non-significant p-value.

Note that only the smallest p-value has the traditional Bonferroni
correction. With the `multcomp`

package we can set the argument `test`

of the function `summary`

accordingly.

`summary(fit.gh, test = adjusted("holm"))`

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: User-defined Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## 1 == 0 -0.0615 0.2414 -0.255 0.801
## 2 == 0 0.3710 0.2788 1.331 0.389
## (Adjusted p values reported -- holm method)
```

In addition, this is implemented in the function `p.adjust`

in `R`

.

### 4.2.3 Scheffé

The **Scheffé** procedure controls for the search over *any* possible
contrast (!). This means we can try out as many contrasts as we like and
still get honest p-values. The price for this very nice property is low
power.

The Scheffé procedure works as follows: Calculate \(F\)-ratio as if ordinary contrast and use the distribution \((g - 1) \cdot F_{g-1, \, N - g}\) instead of \(F_{1, \, N - g}\) to calculate p-values or critical values.

We can manually do this in `R`

with the `multcomp`

package. We first
calculate the contrast as an “ordinary” contrast and then do a manual
calculation. As `glht`

reports the value of the \(t\)-test, we have to
take the square of it. As an example we consider the contrast
\(c = (1/2, -1, 1/2)\).

```
<- glht(fit, linfct = mcp(group = c(1/2, -1, 1/2)))
fit.scheffe
## p-value according to Scheffe (g = 3, N - g = 27)
pf((summary(fit.scheffe)$test$tstat)^2 / 2, 2, 27, lower.tail = FALSE)
```

```
## 1
## 0.05323245
```

### 4.2.4 Tukey Honest Significant Difference (HSD)

A special case for a multiple testing problem is the comparison between
all possible pairs of treatments. There are a total of \(g \cdot (g - 1) / 2\) pairs that we can inspect. We could perform all pairwise \(t\)-tests
with the function `pairwise.t.test`

(it uses a pooled standard deviation
estimate from all groups).

```
## Without correction (but pooled sd estimate)
pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "none")
```

```
##
## Pairwise comparisons using t tests with pooled SD
##
## data: PlantGrowth$weight and PlantGrowth$group
##
## ctrl trt1
## trt1 0.1944 -
## trt2 0.0877 0.0045
##
## P value adjustment method: none
```

The output is a *matrix* of p-values of the corresponding comparisons
(see row and column labels).

We could now use the Bonferroni-Holm correction method, i.e.,
`p.adjust.method = "holm"`

to get p-values that are adjusted for
multiple testing.

```
## With correction (and pooled sd estimate)
pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "holm")
```

```
##
## Pairwise comparisons using t tests with pooled SD
##
## data: PlantGrowth$weight and PlantGrowth$group
##
## ctrl trt1
## trt1 0.194 -
## trt2 0.175 0.013
##
## P value adjustment method: holm
```

There is a better (more powerful) alternative which is called **Tukey
Honest Significant Difference**. Think of a procedure that is custom
tailored for this situation. It gives us both p-values and confidence
intervals. In `R`

this is implemented in the function `TukeyHSD`

or in
the package `multcomp`

.

```
## Tukey HSD with built-in function, including plots:
TukeyHSD(fit)
```

```
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## $group
## diff lwr upr p adj
## trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
## trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
## trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
```

`plot(TukeyHSD(fit))`

Only the difference between `trt1`

and `trt2`

is significant. We get of
course the same results when using the package `multcomp`

.

```
## Tukey HSD with package multcomp:
<- glht(fit, linfct = mcp(group = "Tukey"))
fit.tukey summary(fit.tukey)
```

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.3909
## trt2 - ctrl == 0 0.4940 0.2788 1.772 0.1980
## trt2 - trt1 == 0 0.8650 0.2788 3.103 0.0119 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
```

`confint(fit.tukey)`

```
##
## Simultaneous Confidence Intervals
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Quantile = 2.4792
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## trt1 - ctrl == 0 -0.3710 -1.0621 0.3201
## trt2 - ctrl == 0 0.4940 -0.1971 1.1851
## trt2 - trt1 == 0 0.8650 0.1739 1.5561
```

`plot(confint(fit.tukey))`

### 4.2.5 Multiple Comparison with a Control (MCC)

In the same spirit, if we want to compare all treatment groups with a
control group, we have a so called **multiple comparisons with a
control** problem. The corresponding procedure is called **Dunnett**
procedure which is implemented in the add-on package `multcomp`

. By
default, the first level of the factor is taken as the control group.

```
<- glht(fit, linfct = mcp(group = "Dunnett"))
fit.dunnett summary(fit.dunnett)
```

```
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Dunnett Contrasts
##
##
## Fit: aov(formula = weight ~ group, data = PlantGrowth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## trt1 - ctrl == 0 -0.3710 0.2788 -1.331 0.323
## trt2 - ctrl == 0 0.4940 0.2788 1.772 0.153
## (Adjusted p values reported -- single-step method)
```

We get smaller p-values than with the Tukey HSD procedure because we have to correct for less tests (there are more comparisons between pairs than there are comparisons to the control treatment).

### 4.2.6 FAQ

*Should I only do individual tests if the global \(F\)-test is
significant?*

No, although this still suggested in many textbooks. The above mentioned
procedures have a built-in correction regarding multiple testing and do
*not* rely on a significant \(F\)-test. Conditioning on the \(F\)-test leads
to a very conservative approach regarding type I error rate. In
addition, coverage rate of e.g. Tukey HSD confidence intervals can be
very low.

*Could it be the case that the \(F\)-test is significant but Tukey HSD
yields only insignificant pairwise tests?*

Yes, because the \(F\)-test can combine groups, Tukey HSD cannot (see example in the appendix).

*Could it be the case that Tukey HSD yields a significant difference
but the global \(F\)-test is not significant?*

Yes, because Tukey has larger power for some alternatives because it tries to answer a more precise question (see example in the appendix).

### 4.2.7 Appendix

#### 4.2.7.1 Tukey vs. \(F\)-test

We use the following two data sets consisting of three groups having two observations each.

```
<- factor(rep(c("A", "B", "C"), each = 2))
x
<- c(0.50, 0.62,
y1 0.46, 0.63,
0.95, 0.86)
<- c(0.23, 0.34,
y2 0.45, 0.55,
0.55, 0.66)
```

`stripchart(y1 ~ x, vertical = TRUE, pch = 1)`

For the first data set, the \(F\)-test is significant, but TukeyHSD is not.

```
<- aov(y1 ~ x)
fit1 summary(fit1)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## x 2 0.1659 0.08295 9.683 0.0491 *
## Residuals 3 0.0257 0.00857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`TukeyHSD(fit1)`

```
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = y1 ~ x)
##
## $x
## diff lwr upr p adj
## B-A -0.015 -0.40177079 0.3717708 0.9856937
## C-A 0.345 -0.04177079 0.7317708 0.0668983
## C-B 0.360 -0.02677079 0.7467708 0.0600951
```

`stripchart(y2 ~ x, vertical = TRUE, pch = 1)`

For the second data set, the \(F\)-test is *not* significant, but TukeyHSD
is.

```
<- aov(y2 ~ x)
fit2 summary(fit2)
```

```
## Df Sum Sq Mean Sq F value Pr(>F)
## x 2 0.1064 0.05322 9.336 0.0515 .
## Residuals 3 0.0171 0.00570
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`TukeyHSD(fit2)`

```
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = y2 ~ x)
##
## $x
## diff lwr upr p adj
## B-A 0.215 -0.100489685 0.5304897 0.1268772
## C-A 0.320 0.004510315 0.6354897 0.0481916
## C-B 0.105 -0.210489685 0.4204897 0.4479192
```