Chapter 5 Factorial Treatment Structure

In the completely randomized designs that we have seen so far, the \(g\) different treatments had no special “structure”. In practice, treatments are often combinations of the levels of two or more factors. Think for example of a plant experiment using combinations of light exposure and fertilizer. We call this a factorial treatment structure. If we see all possible combinations of the levels of two (or more) factors, we call them crossed. An illustration of two crossed factors can be found in Table 5.1.

Table 5.1: Example of two Crossed Factors. With “x” we mean that we have data for the specific combination of exposure level (low / medium / high) and fertilizer brand (\(A\) / \(B\)).
exposure / fertilizer \(A\) \(B\)
low x x
medium x x
high x x

Let us have a look at the cheddar data set (Example 8.6 in Oehlert (2010)). An experiment was performed how intentionally added bacteria affect cheese quality. Two strains of bacteria, “R50#10” and “R21#2” were investigated. None (control treatment), both or one of the strains were being added (R50 with levels "yes" and "no" and R21 with levels "yes" and "no"). Total free amino acids was measured as response (acids). The data is available in Table 5.2.

## Create data (skip if not interested) ####
acids <- c(1.697, 1.601, 1.830,
           2.032, 2.017, 2.409,
           2.211, 1.673, 1.973, 
           2.091, 2.255, 2.987)
R50 <- rep(c("no", "yes", "no", "yes"), each = 3)
R21 <- rep(c("no", "no", "yes", "yes"), each = 3)
cheddar <- data.frame(R50, R21, acids)
Table 5.2: Cheddar data set.
R50 R21 acids
no no 1.697
no no 1.601
no no 1.830
R50 R21 acids
yes no 2.032
yes no 2.017
yes no 2.409
R50 R21 acids
no yes 2.211
no yes 1.673
no yes 1.973
R50 R21 acids
yes yes 2.091
yes yes 2.255
yes yes 2.987

Clearly, R50 and R21 are crossed factors. We have 3 observations for every combination of R50 and R21. With R we can easily count the observations with the function xtabs. The formula ~ R50 + R21 in xtabs reads as “count the number of observations for every combination of the levels of the two factors R50 and R21”.

xtabs(~ R50 + R21, data = cheddar)
##      R21
## R50   no yes
##   no   3   3
##   yes  3   3

With a factorial treatment structure we typically have questions about both factors and/or their possible interplay:

  • “Does the effect of adding R50#10 depend on whether we have added R21#10 or not (or vice versa)?”

If the effects do not depend on the other factor, we could also ask:

  • “What is the effect of adding R50#10 on the expected value of the total free amino acids?”

  • “What is the effect of adding R21#10 on the expected value of the total free amino acids?”

We can visualize such kind of data with a so called interaction plot using the function interaction.plot. For every combination of R50 ("yes" / "no") and R21 ("yes" / "no") we calculate the average value of the response. We use R50 on the x-axis (the first argument, also called x.factor). In addition, settings corresponding to the same level of R21 are connected with lines (argument trace.factor). The role of R50 and R21 can of course also be interchanged. Hence, what factor we use as x.factor or as tace.factor is a question of personal preference.

## elegant way, using the function "with"
with(cheddar, interaction.plot(x.factor = R50, trace.factor = R21, response = acids))
## standard way: interaction.plot(x.factor = cheddar$R50, trace.factor = cheddar$R21,
##                                response = cheddar$acids)

We see that the lines are (almost) parallel here. This means that it looks like the effect of R50 does not depend on the level of R21 (and vice versa). Adding “R50#10” seems to increase total free amino acids by about 0.5, while adding “R21#10” has an effect of about 0.3 on total free amino acids. One disadvantage of an interaction plot is the fact that it only plots averages and we do not see the underlying variation of the indidivual observations anymore. Hence, we cannot easily conclude that the effects which we observe in the plot are statistically significant.

A more sophisticated version could be plotted with the add-on package ggplot2 (Wickham et al. 2020) where we plot both the individual observations and the lines (for interested readers only).

library(ggplot2)
ggplot(cheddar, aes(x = R50, y = acids, color = R21)) + geom_point() +
  stat_summary(fun.y = mean, geom = "line", aes(group = R21), size = 1) + theme_bw()

With this plot we immediately see that the effects are rather small compared to the variation of the individual observations. If we have many observations we could also use boxplots for every setting instead of the individual observations.

Let us now set up a parametric model which allows us to make a statement about the uncertainty of the estimated effects.

5.1 Two-Way ANOVA Model

We assume a general setup with a factor \(A\) with \(a\) levels, a factor \(B\) with \(b\) levels and \(n\) replicates for every combination of \(A\) and \(B\) (a balanced design). Hence, we have a total of \(N = a \cdot b \cdot n\) observations. In the previous example we had \(a = 2\), \(b = 2\), \(n = 3\) and therefore \(N = 12\).

The two-way ANOVA model with interaction is \[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk} \] where

  • \(\alpha_i\) is the main effect of factor \(A\) at level \(i\)
  • \(\beta_j\) is the main effect of factor \(B\) at level \(j\)
  • \((\alpha \beta)_{ij}\) is the interaction effect between \(A\) and \(B\) for the level combination \(i,j\) (it is not the product \(\alpha_i \beta_j\))
  • \(\epsilon_{ijk}\) are i.i.d. \(N(0, \sigma^2)\) errors.

Typically, we use sum-to-zero side constraints, i.e. \(\sum_{i=1}^a \alpha_i = 0\), \(\sum_{j=1}^b \beta_j = 0\), \(\sum_{i=1}^a (\alpha \beta)_{ij} = 0\) and \(\sum_{j=1}^b (\alpha \beta)_{ij} = 0\). Other choices are also possible. Hence, the main effects have \(a - 1\) and \(b - 1\) degrees of freedom, respectively. The degrees of freedom of the interaction term are \((a- 1)(b - 1)\) which is the product of the degrees of freedom of the involved main effects.

The effects can be interpreted as follows: Think of main effects as “average effects” on the expected value of the response when changing the level of a factor (keeping the other factor fixed). The interaction effect can be thought of as a correction factor to the main effects model. The interaction effect tells us how much an effect of a certain factor changes, when we “switch” the level of the other factor. Strictly speaking, interpretation depends on the side-constraint that we apply. An illustration of the model can be found in Figure 5.1. It is nothing else than the “theoretical” version of the interaction plot where we have the expected value instead of the sample average on the y-axis.

Illustration of factorial model.

Figure 5.1: Illustration of factorial model.

A model without interaction term is additive. This means that the effect of \(A\) does not depend on the level of \(B\) (and vice versa), “it is always the same, no matter what the level of the other factor”. An illustration can be found in Figure 5.2, where we see that the lines are parallel. This means, changing \(B\) from level 1 to level 2 has always the same effect (here: 2). Similarly, the effect of changing \(A\) does not depend on the level of \(B\) (here: 1 and 3).

Illustration of factorial model without interaction.

Figure 5.2: Illustration of factorial model without interaction.

5.1.1 Parameter Estimation

As before, we estimate parameters using the principles of least squares. Using sum-to-zero side constraints we get

Parameter Estimate
\(\mu\) \(\widehat{\mu} = \overline{y}_{\cdot\cdot\cdot}\)
\(\alpha_i\) \(\widehat{\alpha}_i = \overline{y}_{i\cdot\cdot} - \overline{y}_{\cdot\cdot\cdot}\)
\(\beta_j\) \(\widehat{\beta}_j = \overline{y}_{\cdot j \cdot} - \overline{y}_{\cdot\cdot\cdot}\)
\((\alpha \beta)_{ij}\) \(\widehat{(\alpha \beta)}_{ij} = \overline{y}_{ij\cdot} - \widehat{\mu} - \widehat{\alpha}_i - \widehat{\beta}_j\)

As before: if we replace an index with a dot, we take the mean (or the sum) over that “dimension”. For example, \(\overline{y}_{1\cdot\cdot}\) is the mean over all observations \(y_{1jk}\), \(j = 1, \ldots, b\) and \(k = 1, \ldots, n\). Hence, we estimate the expected value of the response \(Y_{ijk}\) for \(A\) at level \(i\) and \(B\) at level \(j\) as \[ \widehat{\mu} + \widehat{\alpha}_i + \widehat{\beta}_j + \widehat{(\alpha \beta)}_{ij} = \overline{y}_{i j \cdot} \] which is nothing else than the mean of the observations in the corresponding “cell” (which is no surprise). Note however that we “untangled” the effect with respect to the two main effects and the interaction.

If we carefully inspect the parameter estimates from above, we see that for the main effects, we use an estimate that completely ignores the other factor. We basically treat the problem as a one-way ANOVA model. This is a consequence of the balanced design. For all levels of \(A\) we have the “same population” of \(B\) settings. Hence, if we compare \(\overline{y}_{1\cdot\cdot}\) (think of taking the average over the first row) with \(\overline{y}_{2\cdot\cdot}\) (average over second row) the effect is only due to changing \(A\) from level 1 to level 2. In regression terminology we would call this an orthogonal design.

In R we fit the model using the function aov. The model formula for the model including the interaction is acids ~ R50 * R21 which is equivalent to acids ~ R50 + R21 + R50:R21. This means whenever we combine two predictors with *, the corresponding main effects are automatically included (which is typically a reasonable approach). The formula acids ~ R50 + R21 would fit a main effects model.

options(contrasts = c("contr.sum", "contr.poly"))
fit.cheddar <- aov(acids ~ R50 * R21, data = cheddar)
coef(fit.cheddar)
## (Intercept)        R501        R211   R501:R211 
##  2.06466667 -0.23383333 -0.13366667  0.01216667
dummy.coef(fit.cheddar)
## Full coefficients are 
##                           
## (Intercept):      2.064667
## R50:            -0.2338333
## R21:            -0.1336667
## R50:R21:        0.01216667

5.1.2 Tests

As for the one-way ANOVA case, the total sum of squares \(SS_T\) can be partioned into different sources \[ SS_T = SS_A + SS_B + SS_{AB} + SS_E, \] where

Source Sum of squares Comment
\(A\) \(SS_A = \sum_{i=1}^a b n (\widehat{\alpha}_i)^2\) “between rows”
\(B\) \(SS_B = \sum_{j=1}^b a n (\widehat{\beta}_j)^2\) “between columns”
\(AB\) \(SS_{AB} = \sum_{i=1}^a \sum_{j=1}^b n (\widehat{\alpha\beta})_{ij}^2\) “correction”
Error \(SS_E = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^n (y_{ijk} - \overline{y}_{ij\cdot})^2\) “within cells”
Total \(SS_T = \sum_{i=1}^a \sum_{j=1}^b \sum_{k=1}^n (y_{ijk} - \overline{y}_{\cdot\cdot\cdot})^2\) “total”

The degrees of freedom of the error term is nothing else than the degrees of freedom of the total sum of squares (\(abn - 1\), number of observations minus 1) minus the degrees of freedom of the other effects.

We can construct an ANOVA table based on this information

Source df Sum of squares Mean squares \(F\)-ratio
\(A\) \(a - 1\) \(SS_A\) \(MS_A = \frac{SS_A}{a-1}\) \(\frac{MS_A}{MS_E}\)
\(B\) \(b - 1\) \(SS_B\) \(MS_B = \frac{SS_B}{b-1}\) \(\frac{MS_B}{MS_E}\)
\(AB\) \((a-1)(b-1)\) \(SS_{AB}\) \(MS_{AB} = \frac{SS_{AB}}{(a-1)(b-1)}\) \(\frac{MS_{AB}}{MS_E}\)
Error \(ab(n-1)\) \(SS_E\) \(MS_E = \frac{SS_E}{ab(n-1)}\)

As before we can construct tests based on the corresponding \(F\)-distributions.

  • interaction
    • \(H_0: (\alpha \beta)_{ij} = 0\) for all \(i,j\)
    • \(H_A\): at least one \((\alpha \beta)_{ij} \neq 0\)
    • Under \(H_0\): \(\frac{MS_{AB}}{MS_E} \sim F_{(a-1)(b-1), \, ab(n-1)}\)
  • main effect A
    • \(H_0: \alpha_i = 0\) for all \(i\)
    • \(H_A\): at least one \(\alpha_i \neq 0\)
    • Under \(H_0\): \(\frac{MS_A}{MS_E} \sim F_{a-1, \, ab(n-1)}\)
  • main effect B
    • \(H_0: \beta_j = 0\) for all \(j\)
    • \(H_A\): at least one \(\beta_j \neq 0\)
    • Under \(H_0\): \(\frac{MS_B}{MS_E} \sim F_{b-1, \, ab(n-1)}\)

Again, we get the ANOVA table and the corresponding p-values with the summary function. We read the following R output from bottom to top. This means that we first check whether we need an interaction term or not. If there is no evidence of interaction, we continue with the inspection of the main effects. If there is evidence of interaction, we start analyzing the effect in more detail (see next example).

summary(fit.cheddar)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## R50          1 0.6561  0.6561   7.234 0.0275 *
## R21          1 0.2144  0.2144   2.364 0.1627  
## R50:R21      1 0.0018  0.0018   0.020 0.8922  
## Residuals    8 0.7257  0.0907                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, there is no evidence of an interaction between R50 and R21. Only the main effect of R50 is significant. This finding is consistent with what we have observed in the interaction plot. There, the two lines were (almost) parallel. Adding R50#10 (significantly) increases the free amino acids. For R21#10 the interaction plot suggests a larger amount of free amino acids if we add it. However, the effect is not significant. Hence it could very well have happened just by chance.

Let us consider another example which is about fecundity of limpets (some aquatic snails) (Quinn and Keough (2010)). We have a factor season with levels spring and summer and a factor density with 3 levels 6, 12 and 24. Density is the number of limpets “sitting” under the same wire mesh guard. For every combination of density and season we have 3 wire mesh guards whose average number of eggs we record. This will be our response variable. Here, an experimental unit is given by a wire mesh guard and not an individual limpet. We interpret density as a factor and not a continuous variable, as this will give us more flexibility.

## Create data (skip if not interested) ####
season  <- factor(rep(c("Spring", "Summer"), each = 6))
density <- factor(rep(c(6, 12, 24), each = 3))
y <- c(1.17, 0.50, 1.67, 1.50, 0.83, 1.00, 0.67, 0.67, 0.75,
       4.00, 3.83, 3.83, 3.33, 2.58, 2.75, 2.54, 1.83, 1.63)
design <- expand.grid(density = factor(c(6, 12, 24)), season = c("Spring", "Summer"))
snails <- data.frame(design[rep(1:nrow(design), each = 3), ], y = y)

## Have a look at interaction plot ####
with(snails, interaction.plot(x.factor = density, trace.factor = season, response = y))

The interaction plot suggests that there might be an interaction between season and density as the effect of density is less pronounced in spring. Let us check whether this is significant or not.

fit.limpets <- aov(y ~ season * density, data = snails)
summary(fit.limpets)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## season          1 17.131  17.131 119.373 1.36e-07 ***
## density         2  4.001   2.001  13.940 0.000742 ***
## season:density  2  1.689   0.845   5.885 0.016552 *  
## Residuals      12  1.722   0.144                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we conclude that the interaction is significant, i.e. the effect of density is season specific (or the other way round). In such a situation we would (typically) stop inspecting whether the main effects are significant or not.

We could start inspecting some specific questions using appropriate contrasts, e.g., what is the difference between density 6 and 12 in Summer? From a technical point of view, this works best if we construct a so called hyper-factor which has as levels all possible combinations of season and density. In R we can easily do this with the function interaction which “glues together” two factors.

snails[, "combined"] <- interaction(snails[, "season"], snails[, "density"])
levels(snails[, "combined"])
## [1] "Spring.6"  "Summer.6"  "Spring.12" "Summer.12" "Spring.24" "Summer.24"
fit.comb <- aov(y ~ combined, data = snails)

library(multcomp)
## 12 vs. 6 in summer
fit.glht <- glht(fit.comb, linfct = mcp(combined = c(0, -1, 0, 1, 0, 0)))
confint(fit.glht)
## 
##   Simultaneous Confidence Intervals
## 
## Multiple Comparisons of Means: User-defined Contrasts
## 
## 
## Fit: aov(formula = y ~ combined, data = snails)
## 
## Quantile = 2.1788
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##        Estimate lwr     upr    
## 1 == 0 -1.0000  -1.6739 -0.3261

This means that we would conclude that in summer, the difference between density 12 and 6 lies in the interval \([-1.6739, -0.3261]\).

5.1.3 Single Replicates

If we only have a single observation in each “cell” (\(n = 1\)) we cannot do statistical inference anymore with a model including the interaction as we have no idea of the experimental error (for every treatment combination we only have one observation). However, we can still fit a main effects only model. If the data generating mechanism actually contains an interaction, we are fitting a wrong model. The consequences are that the corresponding tests will be too conservative, meaning p-values will be too large. This is not a problem as the type 1 error rate is still controlled. We “just” lose power.

An alternative would be to use a Tukey one-degree of freedom interaction which uses only one additional parameter for the interaction, see Oehlert (2010).

Quite often, we can get rid of interactions if we look at the problem on a different scale, i.e., if we transform the response appropriately. A famous example is the logarithm. Effects that are multiplicative on the original scale become additive on the log-scale, i.e., no interaction is needed on the log-scale.

5.1.4 Checking Model Assumptions

As before we use the QQ-plot and the Tukey-Anscombe plot. We use the previous data as example.

plot(fit.limpets, which = 2)

If in doubt whether this QQ-plot is still OK, we can always compare it to a couple of “nice” (simulated) QQ-plots of the same sample size using the following command (give it a try!)

qqnorm(rnorm(nrow(snails)))

Our QQ-plot looks like one from simulated data. The Tukey-Anscombe plot looks OK too:

plot(fit.limpets, which = 1, add.smooth = FALSE)

The careful reader might wonder why we only see 5 “stripes” instead of 6. The reason is that the two settings Spring, 6 and Spring, 12 have basically the same predicted value \(\widehat{\mu}_{ij}\) (see also the interaction plot).

5.1.5 Unbalanced Data

We started with the (very) strong assumption that our data is balanced, i.e., for every treatment “setting” we have the same number of replicates. This assumptions made our life “easy” in the sense that we could (uniquely) decompose the variability into the different sources and we could estimate the parameters of the coefficients of a factor by ignoring the other factors. In practice, data is typically not balanced. Think for example of a situation where something goes wrong with an experiment in some of the settings. With unbalanced data, the aforementioned properties are (typically) destroyed. We still use least squares to estimate the parameters. The estimates willl look more complicated. However, this is not a problem these days and daily business in most regression problems anyway. More of a problem is the fact that the sum of squares cannot be uniquely partitioned into different sources anymore. This means that for some part of the variation of the data it is not clear to what source we should attribute it.

We “solve” this problem by using an appropriate model comparison approach. As seen before, see for example Equation (3.5), the sum of squares of a factor can be interpreted as the reduction of residual sum of squares when adding the corresponding factor to the model. In the balanced situation it would not matter whether the other factors are in the model or not: the reduction is always the same. Unfortunately, for unbalanced data this property does not hold anymore.

We use the following notation: With SS(B | 1, A) we denote the reduction in residual sum of squares when comparing the model (1, A, B) (= y ~ A + B) with (1, A) (= y ~ A). The 1 is the overall mean \(\mu\) (that does not appear explicitly in the R model formula). Interpretation of the corresponding test is as follows: “Do we need factor B in the model if we already have factor A (or after having controlled for factor A)?”. There are different “ways” or “types” of model comparison approaches.

  • Type I (sequential): Sequentially build up model (depends on the “ordering” of the model terms!)
    • SS(A | 1)
    • SS(B | 1, A)
    • SS(AB | 1, A, B)
  • Type II (hierarchical): Control for the influence of the largest hierarchical model not including the term of interest.
    • SS(A | 1, B)
    • SS(B | 1, A)
    • SS(AB | 1, A, B)
  • Type III (fully adjusted): Control for all other terms.
    • SS(A | 1, B, AB)
    • SS(B | 1, A, AB)
    • SS(AB | 1, A, B)

Type I is what you get with summary of an aov object. Hence, for unbalanced data you get different results whether you write y ~ A * B or y ~ B * A, see also R FAQ 7.18. For type II we can either use the function Anova in the add-on package car or we could compare the appropriate models with the function anova ourselves. For type III we can use the command drop1 (we have to be careful that we set the contrast option to contr.sum in this special situation for technical reasons).

Typically, we take \(MS_E\) from the full model (including all terms) as the estimate for the error standard deviation to construct the corresponding \(F\)-tests.

Note that interpretation of the test depends on the type that we use. This can be seen from the following remarks.

Some Technical Remarks

We use the slightly different notation \[ Y_{ijk} = \mu_{ij} + \epsilon_{ijk} \] where \(\mu_{ij} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij}\). This means we use a cell means model with expected values \(\mu_{ij}\). We denote the corresponding sample sizes for the level combination \((i, j)\) with \(n_{ij} > 0\). It can be shown (Speed, Hocking, and Hackney (1978)) that both type I and type II sum of squares lead to tests that involve null hypotheses that depend on the sample sizes of the individual cells (!).

More precisely, for example for the main effect \(A\) we have:

  • For type I sum of squares we implicitly test the null hypothesis \[ H_0: \sum_{l = 1}^b \frac{n_{1l}}{n_{1\cdot}} \mu_{1l} = \cdots = \sum_{l = 1}^b \frac{n_{al}}{n_{a\cdot}} \mu_{al} \] where \(n_{i\cdot}\) is the sum over the corresponding \(n_{ij}\)’s. The above equation simply means that across all the rows, a weighted mean of the \(\mu_{ij}\)’s is the same. Note that the weights depend on the observed sample sizes \(n_{ij}\). Therefore, the observed sample sizes dictate our null hypothesis! Interpretation of such a research question is rather difficult.

  • Similarly for type II sum of squares we get a (more complicated) formula involving \(n_{ij}\)’s.

  • For type III sum of squares we have \[ H_0: \overline{\mu}_{1\cdot} = \cdots = \overline{\mu}_{a\cdot} \] where \(\overline{\mu}_{i\cdot} = \frac{1}{b} \sum_{l = 1}^b \mu_{il}\) which is the unweighted mean of the corresponding \(\mu_{ij}\)’s. This can be reformulated as \[ H_0: \alpha_1 = \ldots = \alpha_a \] This is what we have used so far, and more importantly, does not depend on \(n_{ij}\).

We can also observe that for a balanced design we test the same null hypothesis, no matter what type we use.

Unbalanced Data Example

We have a look at some (simulated) data about a sports experiment using a factorial design with the factor “training method” (method, with levels conventional and new) and the factor “energy drink” (drink, with levels A and B). Response was running time in seconds for a specific track.

running <- read.table("http://stat.ethz.ch/~meier/teaching/data/running.dat", 
                      header = TRUE)
str(running)
## 'data.frame':    70 obs. of  3 variables:
##  $ method: chr  "new" "new" "new" "new" ...
##  $ drink : chr  "A" "A" "A" "A" ...
##  $ y     : num  40.6 49.7 42.1 42.2 39 44.2 44.1 43.1 44.7 46.3 ...
xtabs(~ method + drink, data = running)
##               drink
## method          A  B
##   conventional 10 40
##   new          10 10

Clearly, this is an unbalanced data set. The output of xtabs gives us all the \(n_{ij}\)’s.

We use contr.sum, otherwise type III sum of squares will be wrong (technical issue).

options(contrasts = c("contr.sum", "contr.poly"))
## Type I sum of squares
fit <- aov(y ~ method * drink, data = running)
summary(fit)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## method        1 2024.0  2024.0 263.719  < 2e-16 ***
## drink         1  455.2   455.2  59.316 9.05e-11 ***
## method:drink  1   29.1    29.1   3.791   0.0558 .  
## Residuals    66  506.5     7.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For example, the row method tests the null hypothesis \[ H_0: 0.2 \cdot \mu_{11} + 0.8 \cdot \mu_{12} = 0.5 \cdot \mu_{21} + 0.5 \cdot \mu_{22} \] (because 80% of conventional use drink B).

Now we change the order of the factors in the model formula.

fit2 <- aov(y ~ drink * method, data = running)
summary(fit2) ## sum of squares change!
##              Df Sum Sq Mean Sq F value Pr(>F)    
## drink         1 1145.9  1145.9 149.299 <2e-16 ***
## method        1 1333.4  1333.4 173.737 <2e-16 ***
## drink:method  1   29.1    29.1   3.791 0.0558 .  
## Residuals    66  506.5     7.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the sum of squares depend on the ordering of the model terms in the model formula if we use a type I approach. Hence, we also get different \(F\)-ratios and different p-values. However, the p-values of the main effects stay very small here, no matter what “ordering” we use.

We can easily get type II sum of squares with the add-on package car.

## Type II sum of squares
library(car)
Anova(fit, type = "II", data = running)
## Anova Table (Type II tests)
## 
## Response: y
##               Sum Sq Df  F value    Pr(>F)    
## method       1333.41  1 173.7365 < 2.2e-16 ***
## drink         455.25  1  59.3164 9.053e-11 ***
## method:drink   29.09  1   3.7908   0.05579 .  
## Residuals     506.54 66                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We could also use the Anova function for type III sum of squares, but drop1 will do the job too.

## Type III sum of squares
drop1(fit, scope = ~., test = "F", data = running)
## or: Anova(fit, type = "III", data = running)
## Single term deletions
## 
## Model:
## y ~ method * drink
##              Df Sum of Sq     RSS    AIC  F value    Pr(>F)    
## <none>                     506.54 146.54                       
## method        1   1352.40 1858.95 235.55 176.2110 < 2.2e-16 ***
## drink         1    484.24  990.78 191.50  63.0935 3.347e-11 ***
## method:drink  1     29.09  535.64 148.45   3.7908   0.05579 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now the row method tests the null hypothesis \[ H_0: \frac{1}{2} \cdot \mu_{11} + \frac{1}{2} \cdot \mu_{12} = \frac{1}{2} \cdot \mu_{21} + \frac{1}{2} \cdot \mu_{22}, \] which is nothing else than saying that the “row-averages” of the \(\mu_{ij}\)’s is constant. Here, the actual sample sizes don’t play a role anymore.

5.2 More Than Two Factors

We can easily extend the model to more than two factors. If we have three factors A, B and C, we have \(3\) main effects, \(3 \cdot 2 / 2 = 3\) two-way interactions and one so-called three-way interaction. We omit the mathematical details and work directly with the corresponding R Code. In R we would write y ~ A * B * C for a model including all main effects, all two-way interactions and a three-way interaction. An equivalent formulation would be y ~ A + B + C + A:B + A:C + B:C + A:B:C.

Main effects are interpreted as average effects, two-way interaction effects are interpreted as deviations from the main effects model (i.e., the correction for an effect that depends on the level of the other factor) and the three-way interaction is an adjustment of the two-way interaction depending on the third factor (quite hard to interpret in practice).

In a balanced design, estimation is as for a two-way ANOVA model by ignoring the remaining factor. To estimate the three-way interaction, we simply subtract all estimates of the main effects and the two-way interactions. The degrees of freedom for the three-way interaction is the product of the degrees of freedom of all involved factors.

Consider Problem 8.6 in Oehlert (2010): “A study looked into the management of various tropical grasses for improved production, measured as dry matter yield in hundreds of pounds per acre over a 54-week study period. The management variables were height of cut (1, 3, or 6 inches), the cutting interval (1, 3, 6, or 9 weeks), and amount of nitrogen fertilizer (0, 8, 16, or 32 hundred pounds of ammonium sulfate per acre per year). Forty-eight plots were assigned in completely randomized fashion to the 48 factor-level combinations.”

We first read in the data, make sure that all variables are factors in R and set the labels of the levels accordingly.

grasses <- read.table("http://users.stat.umn.edu/~gary/book/fcdae.data/pr8.6",
                      header = TRUE)
grasses[, "ht"]   <- factor(grasses[, "ht"],   labels = c("1", "3", "6"))
grasses[, "fert"] <- factor(grasses[, "fert"], labels = c("0", "8", "16", "32"))
grasses[, "int"]  <- factor(grasses[, "int"],  labels = c("1", "3", "6", "9"))

Let us check whether this worked as intended.

str(grasses)
## 'data.frame':    48 obs. of  4 variables:
##  $ ht  : Factor w/ 3 levels "1","3","6": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fert: Factor w/ 4 levels "0","8","16","32": 1 1 1 1 2 2 2 2 3 3 ...
##  $ int : Factor w/ 4 levels "1","3","6","9": 1 2 3 4 1 2 3 4 1 2 ...
##  $ y   : num  74.1 65.4 96.7 147.1 87.4 ...

We can visualize the data with (multiple) interaction plots. For example

with(grasses, interaction.plot(x.factor = fert, trace.factor = int, response = y))

We only have one observation for every combination of ht, fert and int.

xtabs(~ ht + fert + int, data = grasses)
## , , int = 1
## 
##    fert
## ht  0 8 16 32
##   1 1 1  1  1
##   3 1 1  1  1
##   6 1 1  1  1
## 
## , , int = 3
## 
##    fert
## ht  0 8 16 32
##   1 1 1  1  1
##   3 1 1  1  1
##   6 1 1  1  1
## 
## , , int = 6
## 
##    fert
## ht  0 8 16 32
##   1 1 1  1  1
##   3 1 1  1  1
##   6 1 1  1  1
## 
## , , int = 9
## 
##    fert
## ht  0 8 16 32
##   1 1 1  1  1
##   3 1 1  1  1
##   6 1 1  1  1

Hence, we cannot do statistical inference for the full model (including the three-way interaction) as we have no replicates. However, we can still fit a model including all two-way interactions. In R, such a model can be easily specified with the formula y ~ .^2 where . stands for “all terms”.

fit.grasses <- aov(y ~ .^2, data = grasses)
drop1(fit.grasses, scope = ~ ., test = "F") ## could also use summary (balanced design)
## Single term deletions
## 
## Model:
## y ~ (ht + fert + int)^2
##          Df Sum of Sq   RSS    AIC  F value    Pr(>F)    
## <none>                 3155 260.90                       
## ht        2        29  3184 257.34   0.0830   0.92068    
## fert      3     42072 45226 382.72  80.0153 1.334e-10 ***
## int       3     73887 77042 408.28 140.5241 1.120e-12 ***
## ht:fert   6       406  3561 254.71   0.3860   0.87835    
## ht:int    6      3005  6160 281.02   2.8578   0.03903 *  
## fert:int  9      5352  8506 290.51   3.3927   0.01313 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is evidence that both the effects of “height of cut” (ht) and “amount of fertilizer” (fert) depend on the level of “cutting interval” (int), as both the ht:int and fert:int interactions are significant.

The residuals look OK for this data set.

plot(fit.grasses, which = 2)

plot(fit.grasses, which = 1, add.smooth = FALSE)

5.2.1 FAQ

When can I test a certain interaction?

If multiple observations are available for each combination of the levels of the involved factors we can test the interaction. We do not have to consider the remaining factors in the model. For example, if we have three factors \(A\), \(B\) and \(C\) we can do statistical inference for the interaction between \(A\) and \(B\), if we have multiple observations for each combination of the levels of \(A\) and \(B\) even if they correspond to different levels of \(C\).

I have a factor with 3 levels and one with 4 levels and 2 replicates for each level combination. Is this enough data for a two-way ANOVA?

This depends on the degrees of freedom of the error term. We have a total of \(3 \cdot 4 \cdot 2 - 1 = 23\) degrees of freedom. The main effects need \(2\) and \(3\) degrees of freedom, respectively, and the interaction \(2 \cdot 3 = 6\) degrees of freedom. Hence, the error has \(23 - 2 - 3 - 6 = 12\) degrees of freedom. According to the classical rule, this is enough. For a more thorough analysis, a detailed power calculation should be performed.

Bibliography

Oehlert, Gary. 2010. A First Course in Design and Analysis of Experiments. http://users.stat.umn.edu/~gary/book/fcdae.pdf.

Quinn, Gerry P., and Michael J. Keough. 2010. Experimental Design and Data Analysis for Biologists. Cambridge University Press.

Speed, M, R Hocking, and P Hackney. 1978. “Methods of Analysis of Linear Models with Unbalanced Data.” Journal of the American Statistical Association 73 (361): 105–12. https://doi.org/10.1080/01621459.1978.10480012.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2020. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.