# Chapter 3 Completely Randomized Designs

In your introductory course to statistics you learned how to compare
two (independent) groups using the two-sample \(t\)-test. If we have
more than two groups, the \(t\)-test is *not* directly applicable
anymore. We will now develop an extension of the two-sample \(t\)-test
to more than two groups. The two-sample \(t\)-test will still prove to
be very useful later if we do pairwise comparisons between treatments.

## 3.1 One-Way Analysis of Variance

Our goal is to compare \(g \geq 2\) treatments. As available resources
we have \(N\) experimental units that we assign *randomly* to the \(g\)
different treatment groups having \(n_i\) observations each, i.e.
we have \(n_1 + \cdots + n_g = N\). This is a so-called **completely
randomized design (CRD)**. The
optimal choice (with respect to power) of \(n_1, \ldots, n_g\) depends
on our research question (see later). If all the treatment groups have
the same number of experimental units we call the design **balanced**.

If we want to randomize 4 different treatments \(A, B, C\) and \(D\)
to a total of 20 experimental units (using a balanced design with 5
experimental units per treatment), we can use the following `R`

code

```
<- rep(c("A", "B", "C", "D"), each = 5) ## could also use LETTERS[1:4]
treat.ord treat.ord
```

```
## [1] "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D" "D" "D"
## [20] "D"
```

`sample(treat.ord) ## random permutation`

```
## [1] "B" "D" "C" "B" "B" "B" "A" "A" "C" "C" "D" "A" "B" "C" "D" "C" "D" "D" "A"
## [20] "A"
```

This means that the first experimental unit will get treatment \(B\), the second \(D\) and so on.

### 3.1.1 Cell Means Model

We start by formulating a parametric model for our data. Let \(Y_{ij}\) be
the \(j\)th observation in treatment group \(i\), where \(i=1, \ldots, g\) and
\(j = 1, \ldots, n_i\). In the **cell means model** we allow each treatment group to have its *own* expected value
and we assume that the observations are independent and fluctuate around
this value according to a normal distribution, i.e.,
\[\begin{equation}
Y_{ij} \sim N(\mu_i, \sigma^2), \textrm{ independent}
\tag{3.1}
\end{equation}\]

where

- \(\mu_i\) is the expected value of treatment group \(i\)
- \(\sigma^2\) is the variance of the normal distribution.

As for the (standard) two-sample \(t\)-test, the variance is *equal* for all
groups. We can re-write Equation (3.1) as
\[\begin{equation}
Y_{ij} = \mu_i + \epsilon_{ij}
\tag{3.2}
\end{equation}\]
with (random) *errors* \(\epsilon_{ij}\) i.i.d. \(\sim N(0,\sigma^2)\). We
simply separated the normal distribution around \(\mu_i\) into a
deterministic part \(\mu_i\) and a stochastic part \(\epsilon_{ij}\)
fluctuating around *zero*. We also say that \(Y\) is the **response**
and the grouping information is a categorical
**predictor**. Hence, this is nothing else than a
regression model with a categorical predictor and normally distributed
errors (for those that are already familiar with linear regression
models).

A categorical predictor is also called a **factor**. We
sometimes distinguish between unordered (or nominal) and ordered (or
ordinal) factors. An example of an unordered factor would be eye color
(e.g., with levels “brown,” “blue,” “green”) and an example of an
ordered factor would be income class (e.g., with levels “low,” “middle,”
“high”).

We can rewrite Equation (3.2) using \[\begin{equation} \mu_i = \mu + \alpha_i \tag{3.3} \end{equation}\] for \(i = 1, \ldots, g\) to obtain \[\begin{equation} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \tag{3.4} \end{equation}\] with \(\epsilon_{ij}\) i.i.d. \(\sim N(0,\sigma^2)\).

We also call \(\alpha_i\) the \(i\)th **treatment effect**. Think of \(\mu\) as a “global mean” and \(\alpha_i\) as a
“deviation from the global mean due to the \(i\)th treatment.” We will
soon see that this interpretation is not always correct, but it is still
helpful. At first sight this looks like writing down the problem in a
more complex form. However, the formulation in
(3.4) will be very useful later if we have more
than one treatment factor and want to “untangle” the influence of
multiple treatment factors on the response.

If we carefully inspect the parameters of models (3.1)
and (3.4) we observe the following: in
(3.1) we have the parameters \(\mu_1, \ldots, \mu_g\) and
\(\sigma^2\) while in (3.4) we have \(\mu, \alpha_1, \ldots, \alpha_g\) and \(\sigma^2\). We (silently) introduced one
additional parameter. In fact, model (3.4) is
not *identifiable* anymore because we have \(g+1\) parameters to model \(g\)
mean values \(\mu_i\). Or in other words: we can “shift around” effects
between \(\mu\) and the \(\alpha_i\)’s without changing the resulting values
of \(\mu_i\), e.g., we can adjust \(\mu + 10\) and \(\alpha_i - 10\) leading
to the same \(\mu_i\)’s. Hence, we need a side-constraint on the
\(\alpha_i\)’s that “removes” that additional parameter. Typical choices
for such a constraint are

Name | Side-constraint | Interpretation of \(\mu\) | `R` |
---|---|---|---|

weighted sum-to-zero | \(\displaystyle \sum_{i=1}^g n_i \alpha_i = 0\) | \(\mu = \frac{1}{N} \displaystyle \sum_{i=1}^g n_i \mu_i\) | |

sum-to-zero | \(\displaystyle \sum_{i=1}^g \alpha_i = 0\) | \(\mu = \frac{1}{g} \displaystyle \sum_{i=1}^g \mu_i\) | `contr.sum` |

reference group | \(\alpha_1 = 0\) | \(\mu = \mu_1\) | `contr.treatment` |

where we have also listed the interpretation of the parameter \(\mu\) and
the `R`

naming convention. For all of the above choices it holds that
\(\mu\) determines some sort of “global level” of the data and \(\alpha_i\)
contains information about differences between the groups mean \(\mu_i\)
from that “global level.”

Only \(g - 1\) elements of the treatment effects are allowed to vary
freely. In other words: If we know \(g-1\) of the \(\alpha_i\) values, we
automatically know the remaining \(\alpha_i\). We also say that the
treatment effect has \(g - 1\) **degrees of freedom (df)**.

In `R`

the side-constraint is set using the option `contrasts`

(see
examples below). The default value is `"contr.treatment"`

which is the
side-constraint “reference group” from above.

### 3.1.2 Parameter Estimation

We estimate the parameters using the **least squares
criterion** which ensures that the model
fits the data well in the sense that the squared deviation from the
observed data \(y_{ij}\) to the model values \(\mu + \alpha_i\) are
minimized, i.e.

\[
\widehat{\mu}, \widehat{\alpha}_i = \argmin_{\mu, \, \alpha_i} \sum_{i=1}^g
\sum_{j =1}^{n_i} \left(y_{ij} - \mu - \alpha_i\right)^2.
\]
In the parametrization of the cell means model this means
\[
\widehat{\mu}_i = \argmin_{\mu_i} \sum_{i=1}^g
\sum_{j =1}^{n_i} \left(y_{ij} - \mu_i \right)^2.
\]
We use the following notation
\[\begin{align*}
y_{i\cdot} & = \sum_{j=1}^{n_i} y_{ij} & \textrm{sum of group $i$}\\
y_{\cdot \cdot} & = \sum_{i=1}^g \sum_{j=1}^{n_i} y_{ij} &
\textrm{sum of all observations}\\
\overline{y}_{i\cdot} & = \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij} &
\textrm{mean of group $i$}\\
\overline{y}_{\cdot\cdot} & = \frac{1}{N} \sum_{i=1}^g \sum_{j=1}^{n_i} y_{ij} &
\textrm{overall (or total) mean}
\end{align*}\]
As we can independently estimate the values of the different groups, we get
\(\widehat{\mu}_i = \overline{y}_{i \cdot}\), hence we have
\[
\widehat{\mu}_i= \widehat{\mu} + \widehat{\alpha}_i = \overline{y}_{i\cdot}.
\]
Depending on the side-constraint that we use we get different results
for \(\widehat{\alpha}_i\).

The estimate of the error variance \(\widehat{\sigma}^2\) is also called
**mean squared error** \(MS_E\). It is given by
\[
\widehat{\sigma}^2 = MS_E = \frac{1}{N - g} SS_E,
\]
where \(SS_E\) is the **residual** or (**error**)
**sum of squares**
\[
SS_E = \sum_{i=1}^g \sum_{j=1}^{n_i} (y_{ij} - \widehat{\mu}_i)^2.
\]
We can also write
\[
MS_E = \frac{1}{N-g} \sum_{i=1}^g (n_i - 1) s_i^2,
\]
where \(s_i^2\) is the empirical variance in treatment group \(i\),
\[
s_i^2 = \frac{1}{n_i - 1} \sum_{j=1}^{n_i} (y_{ij} - \widehat{\mu}_i)^2.
\]

As in the two-sample situation, the denominator \(N - g\) ensures that \(\widehat{\sigma}^2\) is an unbiased estimator for \(\sigma^2\). We also say that the error estimate has \(N - g\) degrees of freedom.

Let’s have a look at an example using the built-in data-set
`PlantGrowth`

. See `?PlantGrowth`

for details.

```
data(PlantGrowth)
str(PlantGrowth)
```

```
## 'data.frame': 30 obs. of 2 variables:
## $ weight: num 4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
## $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...
```

This means that `group`

is a categorical predictor having three levels,
the reference level (the first level) is `ctrl`

. We can see the levels
using the function `levels`

.

`levels(PlantGrowth$group)`

`## [1] "ctrl" "trt1" "trt2"`

If we wanted to change the reference level we could do this by using the
function `relevel`

.

We can visualize the data by plotting `weight`

versus `group`

or by
using boxplots per level of `group`

.

`stripchart(weight ~ group, vertical = TRUE, pch = 1, data = PlantGrowth)`

```
## boxplot per group
boxplot(weight ~ group, data = PlantGrowth)
```

We now fit the model using the `aov`

function. We state the model using
the formula notation where the response is listed on the left-hand side
and the (only) predictor is on the right hand side of the tilde sign
“`~`

.” The estimated parameters can be extracted using the function
`coef`

.

```
<- aov(weight ~ group, data = PlantGrowth)
fit ## Have a look at the estimated coefficients
coef(fit)
```

```
## (Intercept) grouptrt1 grouptrt2
## 5.032 -0.371 0.494
```

The column labelled with `(Intercept)`

contains \(\widehat{\mu} = 5.032\) which is the estimated value of the expected value of the
*reference group* `ctrl`

(because by default we use `contr.treatment`

and the first level is the reference group; you can check the settings
with the command `options("contrasts")`

). The next column `grouptrt1`

is \(\widehat{\alpha}_2 = -0.371\). This means that the
difference of group `trt1`

to group `ctrl`

is estimated to be
\(-0.371\). The last column `grouptrt2`

is
\(\widehat{\alpha}_3 = 0.494\).
It is the difference of group `trt2`

to the group
`ctrl`

. Hence, for all levels except the reference level we see
differences to the reference group while the estimate of the reference
level can be found in the column `(Intercept)`

.

We get a clearer picture by using the function `dummy.coef`

which lists
the “full coefficients.”

`dummy.coef(fit)`

```
## Full coefficients are
##
## (Intercept): 5.032
## group: ctrl trt1 trt2
## 0.000 -0.371 0.494
```

Interpretation is as in parametrization (3.3).
`(Intercept)`

corresponds to \(\widehat{\mu}\) and
\(\widehat{\alpha}_1, \widehat{\alpha}_2\) and \(\widehat{\alpha}_3\) can be found under
`ctrl`

, `trt1`

and `trt2`

, respectively. For example,
for \(\widehat{\mu}_2\) we have \(\widehat{\mu}_2 = 5.032 - 0.371 = 4.661\).

The estimated cell means \(\widehat{\mu}_i\) (also known as the predicted
value per treatment group) can also be obtained by using the function
`predict`

on the object `fit`

(which contains all information about the
estimated model).

`predict(fit, newdata = data.frame(group = c("ctrl", "trt1", "trt2")))`

```
## 1 2 3
## 5.032 4.661 5.526
```

What happens if we change the side-constraint to sum-to-zero? In `R`

this is called
`"contr.sum"`

. We use the `options`

function to change the encoding (on a global level).
Note that `contrasts`

takes two values. The first one is for unordered factors and
the second one is for so-called ordered factors (we leave the second argument
at its default value `contr.poly`

for the moment).

```
options(contrasts = c("contr.sum", "contr.poly"))
<- aov(weight ~ group, data = PlantGrowth)
fit2 coef(fit2)
```

```
## (Intercept) group1 group2
## 5.073 -0.041 -0.412
```

We get different values because the meaning of the parameters has
changed. If we closely inspect the output, we also see that a slightly
different naming scheme is being used. Instead of `factor name`

and
`level name`

(e.g., `grouptrt1`

) we see `factor name`

and `level number`

(e.g., `group2`

). The column `(Intercept)`

is now the global mean of the data,
`group1`

is the difference of the first group (`ctrl`

) to the global
mean, `group2`

is the difference of the second group (`trt1`

) to the
global mean. The difference of `trt2`

to the global mean is not shown.
Because of the side-constraint we know it must be \(-(-0.041-0.412) = 0.453\). Again, we get the full picture by calling `dummy.coef`

.

`dummy.coef(fit2)`

```
## Full coefficients are
##
## (Intercept): 5.073
## group: ctrl trt1 trt2
## -0.041 -0.412 0.453
```

As before we can get the predicted value per group with

`predict(fit2, newdata = data.frame(group = c("ctrl", "trt1", "trt2")))`

```
## 1 2 3
## 5.032 4.661 5.526
```

Note that the output of `predict`

has *not* changed. The predicted
values do *not* depend on the side-constraint that we employ. But the
side-constraint has a *large* impact on the meaning of the parameters of
the model.

### 3.1.3 Tests

With a two-sample \(t\)-test we could test whether two samples shared the same mean. We will now extend this to the \(g > 2\) situation.

Saying that *all* groups share the
same mean is equivalent to model the data as
\[
Y_{ij} = \mu + \epsilon_{ij}, \, \epsilon_{ij} \,\, \mathrm{i.i.d.} \, \sim N(0, \sigma^2).
\]
This is the so-called **single mean** model. It
is actually a special case of the cell means model where \(\mu_1 = \ldots = \mu_g\) which is equivalent to \(\alpha_1 = \ldots = \alpha_g = 0\).

Hence, our question boils down to comparing two models, the single mean and the cell means model. More formally, we have the (global) null hypothesis \[ H_0: \mu_1 = \ldots = \mu_g \] versus the alternative \[ H_A: \mu_k \neq \mu_l \textrm{ for at least one pair } k \neq l. \] We call it a global null hypothesis because it affects all parameters at the same time.

We will construct a statistical test by decomposing the variation of the data. Total variation of the response around the overall mean can be decomposed into variation “between groups” and variation “within groups.” More formally, we have \[\begin{equation} \underbrace{\sum_{i=1}^g \sum_{j=1}^{n_i}(y_{ij} - \overline{y}_{\cdot\cdot})^2}_{SS_T} = \underbrace{\sum_{i=1}^g \sum_{j=1}^{n_i}(\overline{y}_{i\cdot} - \overline{y}_{\cdot\cdot})^2}_{SS_\mathrm{Trt}} + \underbrace{\sum_{i=1}^g \sum_{j=1}^{n_i}(y_{ij} - \overline{y}_{i\cdot})^2}_{SS_E}, \tag{3.5} \end{equation}\] where \[\begin{align*} SS_T & = \textrm{total sum of squares}\\ SS_\mathrm{Trt} & = \textrm{treatment sum of squares (between groups)}\\ SS_E & = \textrm{error sum of squares (within groups)} \end{align*}\] Hence, we have \[ SS_T = SS_\mathrm{Trt} + SS_E. \] As can be seen from the decomposition in (3.5), \(SS_\mathrm{Trt}\) can also be interpreted as the reduction in residual sum of squares when comparing the cell means with the single mean model (this will be a useful interpretation later).

This information can be summarized in a so-called **ANOVA table**,
where ANOVA stands for analysis of variance:

Source | df | Sum of squares (\(SS\)) | Mean squares (\(MS\)) | \(F\)-ratio |
---|---|---|---|---|

Treatment | \(g - 1\) | \(SS_\mathrm{Trt}\) | \(MS_\mathrm{Trt} = \frac{SS_\mathrm{Trt}}{g-1}\) | \(\frac{MS_\mathrm{Trt}}{MS_E}\) |

Error | \(N - g\) | \(SS_E\) | \(MS_E = \frac{SS_E}{N-g}\) |

The mean squares are sum of squares that are normalized with the
corresponding degrees of freedom. This is a so-called **one-way analysis
of variance** (or short: **one-way ANOVA**) because there is only *one*
factor involved. Another rule of thumb for getting the degrees of
freedom for the error is as follows: Total sum of squares has \(N - 1\)
degrees of freedom (we have \(N\) observations that fluctuate around the
global mean), \(g - 1\) degrees of freedom are “used” for the treatment
effect (\(g\) groups minus one side-constraint). Hence, the remaining part
(the error) has \(N - 1 - (g - 1) = N - g\) degrees of freedom.

If all the groups share the same (theoretical) mean, we expect the
treatment sum of squares to be small: Just due to the random nature of
the data we expect small differences between the different (empirical)
group means. The idea is now to compare the variation *between* groups
with the variation *within* groups. If the variation between groups is
substantially larger than the variation within groups we have evidence
against the null hypothesis.

In fact, it can be shown that under \(H_0\) it holds that \[ F = \frac{MS_{\mathrm{Trt}}}{MS_E} \sim F_{g-1, \, N-g} \] where we denote by \(F_{n, \, m}\) the so-called \(F\)-distribution with \(n\) and \(m\) degrees of freedom. The \(F\)-distribution has two degrees of freedom parameters, one from the numerator (here: \(g - 1\)) and one from the denominator (here: \(N - g\)). See Figure 3.1 for the density of the \(F\)-distribution for some combinations of \(n\) and \(m\). The \(F_{1, \, m}\)-distribution is a special case that we already know, it is the square of a \(t_m\)-distribution (a \(t\)-distribution with \(m\) degrees of freedom).

As with any other statistical test, we reject the null hypothesis if the observed value of the \(F\)-ratio (our test statistics) lies in an “extreme” region of the corresponding \(F\)-distribution. This is a one-sided test, which means that we reject \(H_0\) in favor of \(H_A\) if \(F\) is larger than the 95% quantile (if we use a 5% significance level). We also use the notation \(F_{n, \, m, \, \alpha}\) for the \((\alpha \times 100)\)% quantile.

The \(F\)-test is also called an **omnibus test**
(Latin for “for all”) as it compares *all* group means simultaneously.

To better understand the \(F\)-distribution, we have a look at the
behaviour of the 95%-quantile, as this is where the rejection region
starts. In Figure 3.2 we see that it is mostly the
denominator degrees of freedom that has a large impact on the quantile.
Increasing the denominator degrees of freedom will *decrease* the
corresponding quantile. A small 95%-quantile is a nice property because
we start rejecting “early” (large power!). Hence, we would like the
denominator degrees of freedom to be large. This can be achieved through
large sample size \(N\). However, the curve flattens out at about 10. A
classical rule of thumb is therefore that the denominator degrees of
freedom should be at least 10, otherwise power is typically too low.

In `R`

we can use the `summary`

function to get the ANOVA table and the
p-value.

`summary(fit)`

```
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 3.766 1.8832 4.846 0.0159 *
## Residuals 27 10.492 0.3886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

In the column labelled with `Pr(>F)`

we see the corresponding p-value:
0.0159. Hence, we reject the
null hypothesis on a 5% significance level.

Alternatively, we can also use the command `drop1`

which performs a test
whether a single “term” (here: a factor) can be dropped from the model to
perform the global \(F\)-test.

`drop1(fit, test = "F")`

```
## Single term deletions
##
## Model:
## weight ~ group
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 10.492 -25.517
## group 2 3.7663 14.258 -20.316 4.8461 0.01591 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We get of course the same p-value as above, but `drop1`

will be a useful
function later.

As the global test can also be interpreted as a test for comparing two
different models, namely the cell means and the single means model, we
have yet another approach in `R`

. We can use the function `anova`

to
compare the two models.

```
## Fit single mean model:
<- aov(weight ~ 1, data = PlantGrowth) ## 1 means global mean (intercept)
fit.single ## Compare with cell means model:
anova(fit.single, fit)
```

```
## Analysis of Variance Table
##
## Model 1: weight ~ 1
## Model 2: weight ~ group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 14.258
## 2 27 10.492 2 3.7663 4.8461 0.01591 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Again, we get the same p-value. We would also get the same p-value if
we would use another side-constraint (if we would use the `fit2`

object).

To perform statistical inference for the *individual* \(\alpha_i\)’s we
can use the commands `summary.lm`

for tests and `confint`

for confidence
intervals.

`summary.lm(fit)`

```
##
## Call:
## aov(formula = weight ~ group, data = PlantGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0710 -0.4180 -0.0060 0.2627 1.3690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0320 0.1971 25.527 <2e-16 ***
## grouptrt1 -0.3710 0.2788 -1.331 0.1944
## grouptrt2 0.4940 0.2788 1.772 0.0877 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6234 on 27 degrees of freedom
## Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096
## F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591
```

`confint(fit)`

```
## 2.5 % 97.5 %
## (Intercept) 4.62752600 5.4364740
## grouptrt1 -0.94301261 0.2010126
## grouptrt2 -0.07801261 1.0660126
```

Interpretation of the output *highly* depends on the side-constraint
that is being used. E.g., the confidence interval
\([-0.943, 0.201]\)
from the above output is a confidence interval for the difference
between `trt1`

and `ctrl`

(because we used `contr.treatment`

!).

### 3.1.4 Checking Model Assumptions (Residual Analysis)

Statistical inference (p-values, confidence intervals, …) is only valid if the model assumptions are fulfilled. So far, this means

- are the errors independent?
- are the errors normally distributed?
- is the error variance constant?
- do the errors have mean zero?

The first assumption is most crucial (but also most difficult to check). If the independence assumption is violated, statistical inference can be very inaccurate. In the ANOVA setting, the last assumption is typically not as important compared to a regression setting, as we are typically fitting “large” models.

We *cannot* observe the errors \(\epsilon_{ij}\) but we have the residuals
\(r_{ij}\) as “estimates,”
\[
r_{ij} = y_{ij} - \widehat{\mu}_{i}.
\]
We now introduce different plots to check these assumptions.

**QQ-Plot**

In a **QQ-plot** we plot the empirical quantiles (“what
we see in the data”) vs. the theoretical quantiles (“what we expect
from the model”). The plot should show a more or less straight line
if the distributional assumption is correct. By default, a standard
normal distribution is the theoretical “reference distribution.”

`plot(fit, which = 2)`

If the QQ-plot suggests non-normality, we can try to use a transformation of the response to accommodate this problem. We can for example use the logarithm or the square root if the residuals are skewed to the right (or any power less than one). If the residuals are skewed to the left we could try a power greater than one.

Note that whenever we transform the response this comes with the prize of
a new interpretation of the parameters! Care has to be taken if one
wants to interpret statistical inference (e.g., confidence intervals)
on the original scale as many aspects do *not* easily “transform
back!”

**Tukey-Anscombe Plot (TA-Plot)**

The **Tukey-Anscombe plot** plots
the residuals \(r_{ij}\) vs. the fitted values \(\widehat{\mu}_{i}\).
It allows us to check whether the residuals have constant variance
and whether the residuals have mean zero (i.e. they don’t show any
deterministic pattern).

`plot(fit, which = 1)`

The red line is a smoother (a local average) of the residuals. We
could get rid of it by using the function call `plot(fit, which = 1, add.smooth = FALSE)`

. For the one-way ANOVA situation we could
also read off the same information from the plot of the data itself.
In addition, as we model one expected value per group, the model is
“saturated” and the residuals automatically always have mean zero (per
group).

Quite often we observe a relationship, e.g. a funnel shape, between the variance of the residuals (spread in y-direction) and the mean (on the x-axis). Again, we can try to transform the response to fix this problem, e.g., by transforming the response with the logarithm.

**Index Plot**

If the data has some serial structure (i.e., if observations were
recorded in a certain time order), we typically want to check whether
residuals close in time are more similar than residuals far apart, as
this would be a violation of the independence assumption. We can do so
by using a so-called **index plot** where we plot the
residuals against time. For positively dependendent residuals we would
see time periods where most residuals have the same sign, while for
negatively dependent residuals, the residuals would “jump” too often
from positive to negative compared to independent residuals. Examples
can be found in Figure 3.3.

### 3.1.5 Transformations Affect Interpretation

Whenever we transform the response we implicitly also change the interpretation of the model parameters. Therefore, while it is conceptually attractive to model the problem on an appropriate scale of the response, this typically has the side effect of making interpretation (much) more difficult.

For example, if we use the logarithm,
\[
\log(Y_{ij}) = \mu + \alpha_i + \epsilon_{ij}
\]
all the \(\alpha_i\)’s (and their estimates) have to be interpreted on the
log-scale. Say we use `contr.treatment`

where the first group is the
reference group and we have \(\widehat{\alpha}_2 = 1.5\). This means: on
the log-scale we estimate that the average value of group 2 is \(1.5\)
larger than the average value of group 1 (additive shift). What about
the effect on the *original* scale? On the log-scale we had
\(\mathbb{E}[\log(Y_{ij})] = \mu + \alpha_{i}\). What does this tell us
about \(\mathbb{E}[Y_{ij}]\), the expected value on the *original* scale?
We don’t know as the expected value does (in general) *not* directly
follow the transformation, that is
\[
\mathbb{E}[Y_{ij}] \neq e^{\mu + \alpha_{i}}.
\]
However, we can make a statement about the *median*. On the log-scale
the median is equal to the mean (because we have a symmetric
distribution around \(\mu + \alpha_i\)). Hence,
\[
\mathrm{median}(\log(Y_{ij})) = \mu + \alpha_i.
\]
In contrast to the mean, any quantile directly transforms with a
strictly monotone increasing function. As the median is nothing else
than the 50%-quantile, we have
\[
\mathrm{median}(Y_{ij}) = e^{\mu + \alpha_i}.
\]
Similarly, for the ratio
\[
\frac{\mathrm{median}(Y_{2j})}{\mathrm{median}(Y_{1j})} = \frac{e^{\mu + \alpha_2}}{e^{\mu}} = e^{\alpha_2}.
\]
Hence, we can make a statement that on the original scale the median of
group 2 is \(e^{\widehat{\alpha}_2} = e^{1.5} = 4.48\) as large as the
median of group 1. This means that additive effects on the log-scale become
*multiplicative* effects on the original scale. Unfortunately, the
statement is only about the median and *not* the mean on the original
scale.

If we also consider a confidence interval for \(\alpha_2\), say \([1.2, 1.8]\), the transformed version \([e^{1.2}, e^{1.8}]\) is a confidence
interval for \(e^{\alpha_2}\) which is the *ratio* of medians on the
original scale, that is for
\[
\frac{\mathrm{median}(Y_{2j})}{\mathrm{median}(Y_{1j})}.
\]

## 3.2 Appendix: Connection to Regression

We can also write equation (3.4) in matrix
notation
\[
Y = X \beta + E,
\]
where \(Y \in \mathbb{R}^N\) contains all the response values, \(X\) is
an \(N \times g\) matrix, the so-called **design matrix** ,
\(\beta \in \mathbb{R}^g\) contains all parameters and
\(E \in \mathbb{R}^N\) are the error terms. A *row* of the design matrix
corresponds to an individual *observation*, while a *column* corresponds
to a *parameter*. This is a typical setup for any regression problem.

We use a subset of the `PlantGrowth`

data set (only the
first two observations per group) for illustrational reasons.

`c(1, 2, 11, 12, 21, 22),] PlantGrowth[`

```
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 11 4.81 trt1
## 12 4.17 trt1
## 21 6.31 trt2
## 22 5.12 trt2
```

Hence, for the response vector \(Y\) we have
\[
Y = \left(
\begin{array}{cc}
4.17 \\
5.58 \\
4.81 \\
4.17 \\
6.31 \\
5.12
\end{array}
\right).
\]
How do the design matrix \(X\) and the parameter vector \(\beta\) look
like? This depends on the side-constraint on the \(\alpha_i\)’s. If we use
`contr.treatment`

we get

\[\begin{equation*}
X = \left(
\begin{array}{ccc}
1 & 0 & 0 \\
1 & 0 & 0 \\
1 & 1 & 0 \\
1 & 1 & 0 \\
1 & 0 & 1 \\
1 & 0 & 1 \\
\end{array}
\right), \qquad \beta = \left(
\begin{array}{cc}
\mu \\
\alpha_2 \\
\alpha_3
\end{array}
\right)
\end{equation*}\]
because the group `ctrl`

is the reference level (hence \(\alpha_1 = 0\)).
For `contr.sum`

we have

\[\begin{equation*} X = \left( \begin{array}{rrr} 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & -1 & -1 \\ 1 & -1 & -1 \\ \end{array} \right), \qquad \beta = \left( \begin{array}{cc} \mu \\ \alpha_1 \\ \alpha_2 \end{array} \right) \end{equation*}\] because \(\alpha_3 = -(\alpha_1 + \alpha_2)\).

Hence, what we do is nothing else than fitting a linear regression model
with a categorical predictor. The categorical predictor is being
represented by a *set* of dummy variables.