Chapter 3 Completely Randomized Designs

In your introductory course to statistics you learned how to compare two (independent) groups using the two-sample \(t\)-test. If we have more than two groups, the \(t\)-test is not directly applicable anymore. We will now develop an extension of the two-sample \(t\)-test to more than two groups. The two-sample \(t\)-test will still prove to be very useful later if we do pairwise comparisons between treatments.

3.1 One-Way Analysis of Variance

Our goal is to compare \(g \geq 2\) treatments. As available resources we have \(N\) experimental units that we assign randomly to the \(g\) different treatment groups having \(n_i\) observations each, i.e. we have \(n_1 + \cdots + n_g = N\). This is a so-called completely randomized design (CRD). The optimal choice (with respect to power) of \(n_1, \ldots, n_g\) depends on our research question (see later). If all the treatment groups have the same number of experimental units we call the design balanced.

If we want to randomize 4 different treatments \(A, B, C\) and \(D\) to a total of 20 experimental units (using a balanced design with 5 experimental units per treatment), we can use the following R code

treat.ord <- rep(c("A", "B", "C", "D"), each = 5) ## could also use LETTERS[1:4]
##  [1] "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D" "D" "D"
## [20] "D"
sample(treat.ord) ## random permutation
##  [1] "B" "D" "C" "B" "B" "B" "A" "A" "C" "C" "D" "A" "B" "C" "D" "C" "D" "D" "A"
## [20] "A"

This means that the first experimental unit will get treatment \(A\), the second \(B\) and so on.

3.1.1 Cell Means Model

We start by formulating a parametric model for our data. Let \(Y_{ij}\) be the \(j\)th observation in treatment group \(i\), where \(i=1, \ldots, g\) and \(j = 1, \ldots, n_i\). In the cell means model we allow each treatment group to have its own expected value and we assume that the observations are independent and fluctuate around this value according to a normal distribution, i.e., \[\begin{equation} Y_{ij} \sim N(\mu_i, \sigma^2), \textrm{ independent} \tag{3.1} \end{equation}\]


  • \(\mu_i\) is the expected value of treatment group \(i\)
  • \(\sigma^2\) is the variance of the normal distribution.

As for the (standard) two-sample \(t\)-test, the variance is equal for all groups. We can re-write Equation (3.1) as \[\begin{equation} Y_{ij} = \mu_i + \epsilon_{ij} \tag{3.2} \end{equation}\] with (random) errors \(\epsilon_{ij}\) i.i.d. \(\sim N(0,\sigma^2)\). We simply separated the normal distribution around \(\mu_i\) into a deterministic part \(\mu_i\) and a stochastic part \(\epsilon_{ij}\) fluctuating around zero. We also say that \(Y\) is the response and the grouping information is a categorical predictor. Hence, this is nothing else than a regression model with a categorical predictor and normally distributed errors (for those that are already familiar with linear regression models).

A categorical predictor is also called a factor. We sometimes distinguish between unordered (or nominal) and ordered (or ordinal) factors. An example of an unordered factor would be eye color (e.g., with levels “brown”, “blue”, “green”) and an example of an ordered factor would be income class (e.g., with levels “low”, “middle”, “high”).

We can rewrite Equation (3.2) using \[\begin{equation} \mu_i = \mu + \alpha_i \tag{3.3} \end{equation}\] for \(i = 1, \ldots, g\) to obtain \[\begin{equation} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \tag{3.4} \end{equation}\] with \(\epsilon_{ij}\) i.i.d. \(\sim N(0,\sigma^2)\).

We also call \(\alpha_i\) the \(i\)th treatment effect. Think of \(\mu\) as a “global mean” and \(\alpha_i\) as a “deviation from the global mean due to the \(i\)th treatment”. We will soon see that this interpretation is not always correct, but it is still helpful. At first sight this looks like writing down the problem in a more complex form. However, the formulation in (3.4) will be very useful later if we have more than one treatment factor and want to “untangle” the influence of multiple treatment factors on the response.

If we carefully inspect the parameters of models (3.1) and (3.4) we observe the following: in (3.1) we have the parameters \(\mu_1, \ldots, \mu_g\) and \(\sigma^2\) while in (3.4) we have \(\mu, \alpha_1, \ldots, \alpha_g\) and \(\sigma^2\). We (silently) introduced one additional parameter. In fact, model (3.4) is not identifiable anymore because we have \(g+1\) parameters to model \(g\) mean values \(\mu_i\). Or in other words: we can “shift around” effects between \(\mu\) and the \(\alpha_i\)’s without changing the resulting values of \(\mu_i\), e.g., we can adjust \(\mu + 10\) and \(\alpha_i - 10\) leading to the same \(\mu_i\)’s. Hence, we need a side-constraint on the \(\alpha_i\)’s that “removes” that additional parameter. Typical choices for such a constraint are

Name Side-constraint Interpretation of \(\mu\) R
weighted sum-to-zero \(\displaystyle \sum_{i=1}^g n_i \alpha_i = 0\) \(\mu = \frac{1}{N} \displaystyle \sum_{i=1}^g n_i \mu_i\)
sum-to-zero \(\displaystyle \sum_{i=1}^g \alpha_i = 0\) \(\mu = \frac{1}{g} \displaystyle \sum_{i=1}^g \mu_i\) contr.sum
reference group \(\alpha_1 = 0\) \(\mu = \mu_1\) contr.treatment

where we have also listed the interpretation of the parameter \(\mu\) and the R naming convention. For all of the above choices it holds that \(\mu\) determines some sort of “global level” of the data and \(\alpha_i\) contains information about differences between the groups mean \(\mu_i\) from that “global level”.

Only \(g - 1\) elements of the treatment effects are allowed to vary freely. In other words: If we know \(g-1\) of the \(\alpha_i\) values, we automatically know the remaining \(\alpha_i\). We also say that the treatment effect has \(g - 1\) degrees of freedom (df).

In R the side-constraint is set using the option contrasts (see examples below). The default value is "contr.treatment" which is the side-constraint “reference group” from above.

3.1.2 Parameter Estimation

We estimate the parameters using the least squares criterion which ensures that the model fits the data well in the sense that the squared deviation from the observed data \(y_{ij}\) to the model values \(\mu + \alpha_i\) are minimized, i.e.
\[ \widehat{\mu}, \widehat{\alpha}_i = \argmin_{\mu, \, \alpha_i} \sum_{i=1}^g \sum_{j =1}^{n_i} \left(y_{ij} - \mu - \alpha_i\right)^2. \] In the parametrization of the cell means model this means \[ \widehat{\mu}_i = \argmin_{\mu_i} \sum_{i=1}^g \sum_{j =1}^{n_i} \left(y_{ij} - \mu_i \right)^2. \] We use the following notation \[\begin{align*} y_{i\cdot} & = \sum_{j=1}^{n_i} y_{ij} & \textrm{sum of group $i$}\\ y_{\cdot \cdot} & = \sum_{i=1}^g \sum_{j=1}^{n_i} y_{ij} & \textrm{sum of all observations}\\ \overline{y}_{i\cdot} & = \frac{1}{n_i} \sum_{j=1}^{n_i} y_{ij} & \textrm{mean of group $i$}\\ \overline{y}_{\cdot\cdot} & = \frac{1}{N} \sum_{i=1}^g \sum_{j=1}^{n_i} y_{ij} & \textrm{overall (or total) mean} \end{align*}\] As we can independently estimate the values of the different groups, we get \(\widehat{\mu}_i = \overline{y}_{i \cdot}\), hence we have \[ \widehat{\mu}_i= \widehat{\mu} + \widehat{\alpha}_i = \overline{y}_{i\cdot}. \] Depending on the side-constraint that we use we get different results for \(\widehat{\alpha}_i\).

The estimate of the error variance \(\widehat{\sigma}^2\) is also called mean squared error \(MS_E\). It is given by \[ \widehat{\sigma}^2 = MS_E = \frac{1}{N - g} SS_E, \] where \(SS_E\) is the residual or (error) sum of squares \[ SS_E = \sum_{i=1}^g \sum_{j=1}^{n_i} (y_{ij} - \widehat{\mu}_i)^2. \] We can also write \[ MS_E = \frac{1}{N-g} \sum_{i=1}^g (n_i - 1) s_i^2, \] where \(s_i^2\) is the empirical variance in treatment group \(i\), \[ s_i^2 = \frac{1}{n_i - 1} \sum_{j=1}^{n_i} (y_{ij} - \widehat{\mu}_i)^2. \]

As in the two-sample situation, the denominator \(N - g\) ensures that \(\widehat{\sigma}^2\) is an unbiased estimator for \(\sigma^2\). We also say that the error estimate has \(N - g\) degrees of freedom.

Let’s have a look at an example using the built-in data-set PlantGrowth. See ?PlantGrowth for details.

## 'data.frame':    30 obs. of  2 variables:
##  $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
##  $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

This means that group is a categorical predictor having three levels, the reference level (the first level) is ctrl. We can see the levels using the function levels.

## [1] "ctrl" "trt1" "trt2"

If we wanted to change the reference level we could do this by using the function relevel.

We can visualize the data by plotting weight versus group or by using boxplots per level of group.

stripchart(weight ~ group, vertical = TRUE, pch = 1, data = PlantGrowth)

## boxplot per group
boxplot(weight ~ group, data = PlantGrowth)

We now fit the model using the aov function. We state the model using the formula notation where the response is listed on the left-hand side and the (only) predictor is on the right hand side of the tilde sign “~”. The estimated parameters can be extracted using the function coef.

fit <- aov(weight ~ group, data = PlantGrowth)
## Have a look at the estimated coefficients
## (Intercept)   grouptrt1   grouptrt2 
##       5.032      -0.371       0.494

The column labelled with (Intercept) contains \(\widehat{\mu} = 5.032\) which is the estimated value of the expected value of the reference group ctrl (because by default we use contr.treatment and the first level is the reference group; you can check the settings with the command options("contrasts")). The next column grouptrt1 is \(\widehat{\alpha}_2 = -0.371\). This means that the difference of group trt1 to group ctrl is estimated to be \(-0.371\). The last column grouptrt2 is \(\widehat{\alpha}_3 = 0.494\). It is the difference of group trt2 to the group ctrl. Hence, for all levels except the reference level we see differences to the reference group while the estimate of the reference level can be found in the column (Intercept).

We get a clearer picture by using the function dummy.coef which lists the “full coefficients”.

## Full coefficients are 
## (Intercept):     5.032              
## group:            ctrl   trt1   trt2
##                  0.000 -0.371  0.494

Interpretation is as in parametrization (3.3). (Intercept) corresponds to \(\widehat{\mu}\) and \(\widehat{\alpha}_1, \widehat{\alpha}_2\) and \(\widehat{\alpha}_3\) can be found under ctrl, trt1 and trt2, respectively. For example, for \(\widehat{\mu}_2\) we have \(\widehat{\mu}_2 = 5.032 - 0.371 = 4.661\).

The estimated cell means \(\widehat{\mu}_i\) (also known as the predicted value per treatment group) can also be obtained by using the function predict on the object fit (which contains all information about the estimated model).

predict(fit, newdata = data.frame(group = c("ctrl", "trt1", "trt2")))
##     1     2     3 
## 5.032 4.661 5.526

What happens if we change the side-constraint to sum-to-zero? In R this is called "contr.sum". We use the options function to change the encoding (on a global level). Note that contrasts takes two values. The first one is for unordered factors and the second one is for so-called ordered factors (we leave the second argument at its default value contr.poly for the moment).

options(contrasts = c("contr.sum", "contr.poly")) 
fit2 <- aov(weight ~ group, data = PlantGrowth)
## (Intercept)      group1      group2 
##       5.073      -0.041      -0.412

We get different values because the meaning of the parameters has changed. If we closely inspect the output, we also see that a slightly different naming scheme is being used. Instead of factor name and level name (e.g., grouptrt1) we see factor name and level number (e.g., group2). The column (Intercept) is now the global mean of the data, group1 is the difference of the first group (ctrl) to the global mean, group2 is the difference of the second group (trt1) to the global mean. The difference of trt2 to the global mean is not shown. Because of the side-constraint we know it must be \(-(-0.041-0.412) = 0.453\). Again, we get the full picture by calling dummy.coef.

## Full coefficients are 
## (Intercept):     5.073              
## group:            ctrl   trt1   trt2
##                 -0.041 -0.412  0.453

As before we can get the predicted value per group with

predict(fit2, newdata = data.frame(group = c("ctrl", "trt1", "trt2")))
##     1     2     3 
## 5.032 4.661 5.526

Note that the output of predict has not changed. The predicted values do not depend on the side-constraint that we employ. But the side-constraint has a large impact on the meaning of the parameters of the model.

3.1.3 Tests

With a two-sample \(t\)-test we could test whether two samples shared the same mean. We will now extend this to the \(g > 2\) situation.

Saying that all groups share the same mean is equivalent to model the data as \[ Y_{ij} = \mu + \epsilon_{ij}, \, \epsilon_{ij} \,\, \mathrm{i.i.d.} \, \sim N(0, \sigma^2). \] This is the so-called single mean model. It is actually a special case of the cell means model where \(\mu_1 = \ldots = \mu_g\) which is equivalent to \(\alpha_1 = \ldots = \alpha_g = 0\).

Hence, our question boils down to comparing two models, the single mean and the cell means model. More formally, we have the (global) null hypothesis \[ H_0: \mu_1 = \ldots = \mu_g \] versus the alternative \[ H_A: \mu_k \neq \mu_l \textrm{ for at least one pair } k \neq l. \] We call it a global null hypothesis because it affects all parameters at the same time.

We will construct a statistical test by decomposing the variation of the data. Total variation of the response around the overall mean can be decomposed into variation “between groups” and variation “within groups”. More formally, we have \[\begin{equation} \underbrace{\sum_{i=1}^g \sum_{j=1}^{n_i}(y_{ij} - \overline{y}_{\cdot\cdot})^2}_{SS_T} = \underbrace{\sum_{i=1}^g \sum_{j=1}^{n_i}(\overline{y}_{i\cdot} - \overline{y}_{\cdot\cdot})^2}_{SS_\mathrm{Trt}} + \underbrace{\sum_{i=1}^g \sum_{j=1}^{n_i}(y_{ij} - \overline{y}_{i\cdot})^2}_{SS_E}, \tag{3.5} \end{equation}\] where \[\begin{align*} SS_T & = \textrm{total sum of squares}\\ SS_\mathrm{Trt} & = \textrm{treatment sum of squares (between groups)}\\ SS_E & = \textrm{error sum of squares (within groups)} \end{align*}\] Hence, we have \[ SS_T = SS_\mathrm{Trt} + SS_E. \] As can be seen from the decomposition in (3.5), \(SS_\mathrm{Trt}\) can also be interpreted as the reduction in residual sum of squares when comparing the cell means with the single mean model (this will be a useful interpretation later).

This information can be summarized in a so-called ANOVA table, where ANOVA stands for analysis of variance:

Source df Sum of squares (\(SS\)) Mean squares (\(MS\)) \(F\)-ratio
Treatment \(g - 1\) \(SS_\mathrm{Trt}\) \(MS_\mathrm{Trt} = \frac{SS_\mathrm{Trt}}{g-1}\) \(\frac{MS_\mathrm{Trt}}{MS_E}\)
Error \(N - g\) \(SS_E\) \(MS_E = \frac{SS_E}{N-g}\)

The mean squares are sum of squares that are normalized with the corresponding degrees of freedom. This is a so-called one-way analysis of variance (or short: one-way ANOVA) because there is only one factor involved. Another rule of thumb for getting the degrees of freedom for the error is as follows: Total sum of squares has \(N - 1\) degrees of freedom (we have \(N\) observations that fluctuate around the global mean), \(g - 1\) degrees of freedom are “used” for the treatment effect (\(g\) groups minus one side-constraint). Hence, the remaining part (the error) has \(N - 1 - (g - 1) = N - g\) degrees of freedom.

If all the groups share the same (theoretical) mean, we expect the treatment sum of squares to be small: Just due to the random nature of the data we expect small differences between the different (empirical) group means. The idea is now to compare the variation between groups with the variation within groups. If the variation between groups is substantially larger than the variation within groups we have evidence against the null hypothesis.

In fact, it can be shown that under \(H_0\) it holds that \[ F = \frac{MS_{\mathrm{Trt}}}{MS_E} \sim F_{g-1, \, N-g} \] where we denote by \(F_{n, \, m}\) the so-called \(F\)-distribution with \(n\) and \(m\) degrees of freedom. The \(F\)-distribution has two degrees of freedom parameters, one from the numerator (here: \(g - 1\)) and one from the denominator (here: \(N - g\)). See Figure 3.1 for the density of the \(F\)-distribution for some combinations of \(n\) and \(m\). The \(F_{1, \, m}\)-distribution is a special case that we already know, it is the square of a \(t_m\)-distribution (a \(t\)-distribution with \(m\) degrees of freedom).

Densities of $F$-distributions.

Figure 3.1: Densities of \(F\)-distributions.

As with any other statistical test, we reject the null hypothesis if the observed value of the \(F\)-ratio (our test statistics) lies in an “extreme” region of the corresponding \(F\)-distribution. This is a one-sided test, which means that we reject \(H_0\) in favor of \(H_A\) if \(F\) is larger than the 95% quantile (if we use a 5% significance level). We also use the notation \(F_{n, \, m, \, \alpha}\) for the \((\alpha \times 100)\)% quantile.

The \(F\)-test is also called an omnibus test (Latin for “for all”) as it compares all group means simultaneously.

To better understand the \(F\)-distribution, we have a look at the behaviour of the 95%-quantile, as this is where the rejection region starts. In Figure 3.2 we see that it is mostly the denominator degrees of freedom that has a large impact on the quantile. Increasing the denominator degrees of freedom will decrease the corresponding quantile. A small 95%-quantile is a nice property because we start rejecting “early” (large power!). Hence, we would like the denominator degrees of freedom to be large. This can be achieved through large sample size \(N\). However, the curve flattens out at about 10. A classical rule of thumb is therefore that the denominator degrees of freedom should be at least 10, otherwise power is typically too low.

95%-quantiles of $F$-distributions. The quantile mainly depends on the denominator degrees of freedom with the curves flattening out at about 10.

Figure 3.2: 95%-quantiles of \(F\)-distributions. The quantile mainly depends on the denominator degrees of freedom with the curves flattening out at about 10.

In R we can use the summary function to get the ANOVA table and the p-value.

##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  3.766  1.8832   4.846 0.0159 *
## Residuals   27 10.492  0.3886                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the column labelled with Pr(>F) we see the corresponding p-value: 0.0159. Hence, we reject the null hypothesis on a 5% significance level.

Alternatively, we can also use the command drop1 which performs a test whether a single “term” (here: a factor) can be dropped from the model to perform the global \(F\)-test.

drop1(fit, test = "F")
## Single term deletions
## Model:
## weight ~ group
##        Df Sum of Sq    RSS     AIC F value  Pr(>F)  
## <none>              10.492 -25.517                  
## group   2    3.7663 14.258 -20.316  4.8461 0.01591 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We get of course the same p-value as above, but drop1 will be a useful function later.

As the global test can also be interpreted as a test for comparing two different models, namely the cell means and the single means model, we have yet another approach in R. We can use the function anova to compare the two models.

## Fit single mean model:
fit.single <- aov(weight ~ 1, data = PlantGrowth) ## 1 means global mean (intercept)
## Compare with cell means model:
anova(fit.single, fit)
## Analysis of Variance Table
## Model 1: weight ~ 1
## Model 2: weight ~ group
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     29 14.258                              
## 2     27 10.492  2    3.7663 4.8461 0.01591 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, we get the same p-value. We would also get the same p-value if we would use another side-constraint (if we would use the fit2 object).

To perform statistical inference for the individual \(\alpha_i\)’s we can use the commands summary.lm for tests and confint for confidence intervals.

## Call:
## aov(formula = weight ~ group, data = PlantGrowth)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0710 -0.4180 -0.0060  0.2627  1.3690 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0320     0.1971  25.527   <2e-16 ***
## grouptrt1    -0.3710     0.2788  -1.331   0.1944    
## grouptrt2     0.4940     0.2788   1.772   0.0877 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.6234 on 27 degrees of freedom
## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2096 
## F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591
##                   2.5 %    97.5 %
## (Intercept)  4.62752600 5.4364740
## grouptrt1   -0.94301261 0.2010126
## grouptrt2   -0.07801261 1.0660126

Interpretation of the output highly depends on the side-constraint that is being used. E.g., the confidence interval \([-0.943, 0.201]\) from the above output is a confidence interval for the difference between trt1 and ctrl (because we used contr.treatment!).

3.1.4 Checking Model Assumptions (Residual Analysis)

Statistical inference (p-values, confidence intervals, …) is only valid if the model assumptions are fulfilled. So far, this means

  • are the errors independent?
  • are the errors normally distributed?
  • is the error variance constant?
  • do the errors have mean zero?

The first assumption is most crucial (but also most difficult to check). If the independence assumption is violated, statistical inference can be very inaccurate. In the ANOVA setting, the last assumption is typically not as important compared to a regression setting, as we are typically fitting “large” models.

We cannot observe the errors \(\epsilon_{ij}\) but we have the residuals \(r_{ij}\) as “estimates”, \[ r_{ij} = y_{ij} - \widehat{\mu}_{i}. \] We now introduce different plots to check these assumptions.


In a QQ-plot we plot the empirical quantiles (“what we see in the data”) vs. the theoretical quantiles (“what we expect from the model”). The plot should show a more or less straight line if the distributional assumption is correct. By default, a standard normal distribution is the theoretical “reference distribution”.

plot(fit, which = 2)

If the QQ-plot suggests non-normality, we can try to use a transformation of the response to accommodate this problem. We can for example use the logarithm or the square root if the residuals are skewed to the right (or any power less than one). If the residuals are skewed to the left we could try a power greater than one.

Note that whenever we transform the response this comes with the prize of a new interpretation of the parameters! Care has to be taken if one wants to interpret statistical inference (e.g., confidence intervals) on the original scale as many aspects do not easily “transform back”!

Tukey-Anscombe Plot (TA-Plot)

The Tukey-Anscombe plot plots the residuals \(r_{ij}\) vs. the fitted values \(\widehat{\mu}_{i}\). It allows us to check whether the residuals have constant variance and whether the residuals have mean zero (i.e. they don’t show any deterministic pattern).

plot(fit, which = 1)

The red line is a smoother (a local average) of the residuals. We could get rid of it by using the function call plot(fit, which = 1, add.smooth = FALSE). For the one-way ANOVA situation we could also read off the same information from the plot of the data itself. In addition, as we model one expected value per group, the model is “saturated” and the residuals automatically always have mean zero (per group).

Quite often we observe a relationship, e.g. a funnel shape, between the variance of the residuals (spread in y-direction) and the mean (on the x-axis). Again, we can try to transform the response to fix this problem, e.g., by transforming the response with the logarithm.

Index Plot

If the data has some serial structure (i.e., if observations were recorded in a certain time order), we typically want to check whether residuals close in time are more similar than residuals far apart, as this would be a violation of the independence assumption. We can do so by using a so-called index plot where we plot the residuals against time. For positively dependendent residuals we would see time periods where most residuals have the same sign, while for negatively dependent residuals, the residuals would “jump” too often from positive to negative compared to independent residuals. Examples can be found in Figure 3.3.

Examples of index plots: independent (top), positively correlated (middle) and negatively correlated residuals (bottom).

Figure 3.3: Examples of index plots: independent (top), positively correlated (middle) and negatively correlated residuals (bottom).

3.1.5 Transformations Affect Interpretation

Whenever we transform the response we implicitly also change the interpretation of the model parameters. Therefore, while it is conceptually attractive to model the problem on an appropriate scale of the response, this typically has the side effect of making interpretation (much) more difficult.

For example, if we use the logarithm, \[ \log(Y_{ij}) = \mu + \alpha_i + \epsilon_{ij} \] all the \(\alpha_i\)’s (and their estimates) have to be interpreted on the log-scale. Say we use contr.treatment where the first group is the reference group and we have \(\widehat{\alpha}_2 = 1.5\). This means: on the log-scale we estimate that the average value of group 2 is \(1.5\) larger than the average value of group 1 (additive shift). What about the effect on the original scale? On the log-scale we had \(\mathbb{E}[\log(Y_{ij})] = \mu + \alpha_{i}\). What does this tell us about \(\mathbb{E}[Y_{ij}]\), the expected value on the original scale? We don’t know as the expected value does (in general) not directly follow the transformation, that is \[ \mathbb{E}[Y_{ij}] \neq e^{\mu + \alpha_{i}}. \] However, we can make a statement about the median. On the log-scale the median is equal to the mean (because we have a symmetric distribution around \(\mu + \alpha_i\)). Hence, \[ \mathrm{median}(\log(Y_{ij})) = \mu + \alpha_i. \] In contrast to the mean, any quantile directly transforms with a strictly monotone increasing function. As the median is nothing else than the 50%-quantile, we have \[ \mathrm{median}(Y_{ij}) = e^{\mu + \alpha_i}. \] Similarly, for the ratio \[ \frac{\mathrm{median}(Y_{2j})}{\mathrm{median}(Y_{1j})} = \frac{e^{\mu + \alpha_2}}{e^{\mu}} = e^{\alpha_2}. \] Hence, we can make a statement that on the original scale the median of group 2 is \(e^{\widehat{\alpha}_2} = e^{1.5} = 4.48\) as large as the median of group 1. This means that additive effects on the log-scale become multiplicative effects on the original scale. Unfortunately, the statement is only about the median and not the mean on the original scale.

If we also consider a confidence interval for \(\alpha_2\), say \([1.2, 1.8]\), the transformed version \([e^{1.2}, e^{1.8}]\) is a confidence interval for \(e^{\alpha_2}\) which is the ratio of medians on the original scale, that is for \[ \frac{\mathrm{median}(Y_{2j})}{\mathrm{median}(Y_{1j})}. \]

3.2 Appendix: Connection to Regression

We can also write equation (3.4) in matrix notation \[ Y = X \beta + E, \] where \(Y \in \mathbb{R}^N\) contains all the response values, \(X\) is an \(N \times g\) matrix, the so-called design matrix , \(\beta \in \mathbb{R}^g\) contains all parameters and \(E \in \mathbb{R}^N\) are the error terms. A row of the design matrix corresponds to an individual observation, while a column corresponds to a parameter. This is a typical setup for any regression problem.

We use a subset of the PlantGrowth data set (only the first two observations per group) for illustrational reasons.

PlantGrowth[c(1, 2, 11, 12, 21, 22),]
##    weight group
## 1    4.17  ctrl
## 2    5.58  ctrl
## 11   4.81  trt1
## 12   4.17  trt1
## 21   6.31  trt2
## 22   5.12  trt2

Hence, for the response vector \(Y\) we have \[ Y = \left( \begin{array}{cc} 4.17 \\ 5.58 \\ 4.81 \\ 4.17 \\ 6.31 \\ 5.12 \end{array} \right). \] How do the design matrix \(X\) and the parameter vector \(\beta\) look like? This depends on the side-constraint on the \(\alpha_i\)’s. If we use contr.treatment we get

\[\begin{equation*} X = \left( \begin{array}{ccc} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ \end{array} \right), \qquad \beta = \left( \begin{array}{cc} \mu \\ \alpha_2 \\ \alpha_3 \end{array} \right) \end{equation*}\] because the group ctrl is the reference level (hence \(\alpha_1 = 0\)). For contr.sum we have

\[\begin{equation*} X = \left( \begin{array}{rrr} 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & -1 & -1 \\ 1 & -1 & -1 \\ \end{array} \right), \qquad \beta = \left( \begin{array}{cc} \mu \\ \alpha_1 \\ \alpha_2 \end{array} \right) \end{equation*}\] because \(\alpha_3 = -(\alpha_1 + \alpha_2)\).

Hence, what we do is nothing else than fitting a linear regression model with a categorical predictor. The categorical predictor is being represented by a set of dummy variables.